Summary

엔비디아에서는 Nemotron-4 340B 모델군을 발표했습니다. 이 모델은 Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, Nemotron-4-340B-Reward로 구성되어 있으며, 연구 및 상업적 응용을 위해 오픈 액세스 모델로 제공됩니다. Google Research에서는 Human I/O를 소개했으며, 이는 대규모 언어 모델을 활용하여 상황적 장애를 감지하는 시스템입니다. Lamini는 새로운 메모리 튜닝 방법을 발표하여 사실 정확도를 95%까지 향상시켰습니다.

Nemotron-4 340B 발표,

Nemotron-4 340B

링크, 2024년 6월 14일,
엔비디아

  • Nemotron-4 340B 모델군 발표 (Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, Nemotron-4-340B-Reward)
  • NVIDIA Open Model License Agreement 하에 오픈 액세스 제공
  • 8개의 GPU를 사용하는 DGX H100에서 FP8 정밀도로 배포 가능
  • 모델 정렬 과정에서 사용된 데이터의 98% 이상이 합성 데이터
  • 모델 개발을 지원하기 위해 합성 데이터 생성 파이프라인도 오픈 소스 제공

Human I/O: Detecting situational impairments with large language models,

Human I/O: 대규모 언어 모델을 활용한 상황적 장애 감지

링크, 2024년 6월 14일,
Google Research

  • SIIDs(상황적 장애) 감지를 위한 Human I/O 시스템 소개
  • egocentric vision, 멀티모달 센싱, LLM reasoning 활용
  • 60개의 실제 시나리오에서 82%의 정확도로 사용자의 상호작용 능력 평가
  • 사용자의 시각, 청각, 발성, 손 사용 가능성을 예측하여 디바이스 적응

Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations,

Lamini 메모리 튜닝: 95% LLM 정확도, 10배 감소된 환각

링크, 2024년 6월 14일,
Lamini

  • 새로운 메모리 튜닝 방법 발표
  • 사실 정확도를 95%까지 향상시키고 환각을 10% 이하로 감소
  • 정보 검색에서 영감을 받아 인덱스에서 관련 전문가만 선택하여 사용
  • Fortune 500 고객 사례에서 텍스트-SQL 에이전트의 정확도가 50%에서 95%로 향상

HelpSteer2: Open-source dataset for training top-performing reward models,

HelpSteer2: 최상위 성능의 보상 모델 훈련을 위한 오픈 소스 데이터셋

링크, 2024년 6월 12일,
Zhilin Wang 외

  • HelpSteer2 데이터셋 발표, CC-BY-4.0 라이선스로 제공
  • 기존 데이터셋 대비 10배 적은 10,000개의 응답 쌍으로 구성
  • Reward-Bench의 주요 데이터셋에서 SOTA 점수(92.0%) 달성
  • SteerLM 2.0 모델 정렬 방법 제안, 보상 모델의 다중 속성 점수 활용

ChatGPT is bullshit,

ChatGPT는 허튼소리다

링크, 2024년 6월 14일,
Michael Townsen Hicks 외

  • 대규모 언어 모델이 진실에 무관심하다는 관점에서 “bullshit”이라고 주장
  • AI 허위 정보의 특성을 설명하고, 이를 예측하고 논의하는 더 유용한 방법 제시

Multi AI Agent Systems 101,

다중 AI 에이전트 시스템 101d

링크, 2024년 6월 17일,
Mariya Mansurova

  • 다중 AI 에이전트 시스템의 개요 설명
  • AutoGen, LangGraph, CrewAI 등 다양한 프레임워크 소개
  • CrewAI를 사용하여 데이터 소스 관리 자동화 예제 제시
Sources This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    350
    351
    352
    353
    354
    355
    356
    357
    358
    359
    360
    361
    362
    363
    364
    365
    366
    367
    368
    369
    370
    371
    372
    373
    374
    375
    376
    377
    378
    379
    380
    381
    382
    383
    384
    385
    386
    387
    388
    389
    390
    391
    392
    393
    394
    395
    396
    397
    398
    399
    400
    401
    402
    403
    404
    405
    406
    407
    408
    409
    410
    411
    412
    413
    414
    415
    416
    417
    418
    419
    420
    421
    422
    423
    424
    425
    426
    427
    428
    429
    430
    431
    432
    433
    434
    435
    436
    437
    438
    439
    440
    441
    442
    443
    444
    445
    446
    447
    448
    449
    450
    451
    452
    453
    454
    455
    456
    457
    458
    459
    460
    461
    462
    463
    464
    465
    466
    467
    468
    469
    470
    471
    472
    473
    474
    475
    476
    477
    478
    479
    480
    481
    482
    483
    484
    485
    486
    487
    488
    489
    490
    491
    492
    493
    494
    495
    496
    497
    498
    499
    500
    501
    502
    503
    504
    505
    506
    507
    508
    509
    510
    511
    512
    513
    514
    515
    516
    517
    518
    519
    520
    521
    522
    523
    524
    525
    526
    527
    528
    529
    530
    531
    532
    533
    534
    535
    536
    537
    538
    539
    540
    541
    542
    543
    544
    545
    546
    547
    548
    549
    550
    551
    552
    553
    554
    555
    556
    557
    558
    559
    560
    561
    562
    563
    564
    565
    566
    567
    568
    569
    570
    571
    572
    573
    574
    575
    576
    577
    578
    579
    580
    581
    582
    583
    584
    585
    586
    587
    588
    589
    590
    591
    592
    593
    594
    595
    596
    597
    598
    599
    600
    601
    602
    603
    604
    605
    606
    607
    608
    609
    610
    611
    612
    613
    614
    615
    616
    617
    618
    619
    620
    621
    622
    623
    624
    625
    626
    627
    628
    629
    630
    631
    632
    633
    634
    635
    636
    637
    638
    639
    640
    641
    642
    643
    644
    645
    646
    647
    648
    649
    650
    651
    652
    653
    654
    655
    656
    657
    658
    659
    660
    661
    662
    663
    664
    665
    666
    667
    668
    669
    670
    671
    672
    673
    674
    675
    676
    677
    678
    679
    680
    681
    682
    683
    684
    685
    686
    687
    688
    689
    690
    691
    692
    693
    694
    695
    696
    697
    698
    699
    700
    701
    702
    703
    704
    705
    706
    707
    708
    709
    710
    711
    712
    713
    714
    715
    716
    717
    718
    719
    720
    721
    722
    723
    724
    725
    726
    727
    728
    729
    730
    731
    732
    733
    734
    735
    736
    737
    738
    739
    740
    741
    742
    743
    744
    745
    746
    747
    748
    749
    750
    751
    752
    753
    754
    755
    756
    757
    758
    759
    760
    761
    762
    763
    764
    765
    766
    767
    768
    769
    770
    771
    772
    773
    774
    775
    776
    777
    778
    779
    780
    781
    782
    783
    784
    785
    786
    787
    788
    789
    790
    791
    792
    793
    794
    795
    796
    797
    798
    799
    800
    801
    802
    803
    804
    805
    806
    807
    808
    809
    810
    811
    812
    813
    814
    815
    816
    817
    818
    819
    820
    821
    822
    823
    824
    825
    826
    827
    828
    829
    830
    831
    832
    833
    834
    835
    836
    837
    838
    839
    840
    841
    842
    843
    844
    845
    846
    847
    848
    849
    850
    851
    852
    853
    854
    855
    856
    857
    858
    859
    860
    861
    862
    863
    864
    865
    866
    867
    868
    869
    870
    871
    872
    873
    874
    875
    876
    877
    878
    879
    880
    881
    882
    883
    884
    885
    886
    887
    888
    889
    890
    891
    892
    893
    894
    895
    896
    897
    898
    899
    900
    901
    902
    903
    904
    905
    906
    907
    908
    909
    910
    911
    912
    913
    914
    915
    916
    917
    918
    919
    920
    921
    922
    923
    924
    925
    926
    927
    928
    929
    930
    931
    932
    933
    934
    935
    936
    937
    938
    939
    940
    941
    942
    943
    944
    945
    946
    947
    948
    949
    950
    951
    952
    953
    954
    955
    956
    957
    958
    959
    960
    961
    962
    963
    964
    965
    966
    967
    968
    969
    970
    971
    972
    973
    974
    975
    976
    977
    978
    979
    980
    981
    982
    983
    984
    985
    986
    987
    988
    989
    990
    991
    992
    993
    994
    995
    996
    997
    998
    999
    1000
    1001
    1002
    1003
    1004
    1005
    1006
    1007
    1008
    1009
    1010
    1011
    1012
    1013
    1014
    1015
    1016
    1017
    1018
    1019
    1020
    1021
    1022
    1023
    1024
    1025
    1026
    1027
    1028
    1029
    1030
    1031
    1032
    1033
    1034
    1035
    1036
    1037
    1038
    1039
    1040
    1041
    1042
    1043
    1044
    1045
    1046
    1047
    1048
    1049
    1050
    1051
    1052
    1053
    1054
    1055
    1056
    1057
    1058
    1059
    1060
    1061
    1062
    1063
    1064
    1065
    1066
    1067
    1068
    1069
    1070
    1071
    1072
    1073
    1074
    1075
    ###
    https://research.nvidia.com/publication/2024-06_nemotron-4-340b

    Publications
    Nemotron-4 340B
    Nemotron-4 340B

    We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows the distribution, modification, and use of the models and their outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.

    Publication Date
    Friday, June 14, 2024
    Research Area
    Artificial Intelligence and Machine Learning
    Generative AI

    Model Overview
    Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

    The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:

    Supervised Fine-tuning (SFT)
    Direct Preference Optimization (DPO)
    Reward-aware Preference Optimization (RPO) (Additional in-house alignment technique)
    Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report.

    This results in a model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction-following, and is capable of generating high quality synthetic data for a variety of use cases.

    Under the NVIDIA Open Model License, NVIDIA confirms:

    Models are commercially usable.
    You are free to create and distribute Derivative Models.
    NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
    License:
    NVIDIA Open Model License

    Intended use
    Nemotron-4-340B-Instruct is a chat model intended for use for the English language.

    Nemotron-4-340B-Instruct is designed for Synthetic Data Generation to enable developers and enterprises for building and customizing their own large language models and LLM applications.

    The instruct model itself can be further customized using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner.

    Model Developer: NVIDIA

    Model Dates: Nemotron-4-340B-Instruct was trained between December 2023 and May 2024.

    Data Freshness: The pretraining data has a cutoff of June 2023.

    Required Hardware
    BF16 Inference:

    8x H200 (1x H200 node)
    16x H100 (2x H100 nodes)
    16x A100 80GB (2x A100 80GB nodes)
    Model Architecture:
    Nemotron-4-340B-Instruct is standard decoder-only Transformer, trained with a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE).

    Architecture Type: Transformer Decoder (auto-regressive language model)

    Network Architecture: Nemotron-4

    ###
    https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
    NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
    Nemotron-4 340B, a family of models optimized for NVIDIA NeMo and NVIDIA TensorRT-LLM, includes cutting-edge instruct and reward models, and a dataset for generative AI training.
    June 14, 2024 by Ankit Patel
    Share

    NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.

    High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM — but robust datasets can be prohibitively expensive and difficult to access.

    Through a uniquely permissive open model license, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.

    The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with NVIDIA NeMo, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They’re also optimized for inference with the open-source NVIDIA TensorRT-LLM library.

    Nemotron-4 340B can be downloaded now from Hugging Face. Developers will soon be able to access the models at ai.nvidia.com, where they’ll be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.

    Navigating Nemotron to Generate Synthetic Data
    LLMs can help developers generate synthetic training data in scenarios where access to large, diverse labeled datasets is limited.

    The Nemotron-4 340B Instruct model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.

    Then, to boost the quality of the AI-generated data, developers can use the Nemotron-4 340B Reward model to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It’s currently first place on the Hugging Face RewardBench leaderboard, created by AI2, for evaluating the capabilities, safety and pitfalls of reward models.

    nemotron synthetic data generation pipeline diagram
    In this synthetic data generation pipeline, (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text — providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements.
    Researchers can also create their own instruct or reward models by customizing the Nemotron-4 340B Base model using their proprietary data, combined with the included HelpSteer2 dataset.

    Fine-Tuning With NeMo, Optimizing for Inference With TensorRT-LLM
    Using open-source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruct and reward models to generate synthetic data and to score responses.

    All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.

    Nemotron-4 340B Base, trained on 9 trillion tokens, can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pretraining data and yields more accurate outputs for specific downstream tasks.

    A variety of customization methods are available through the NeMo framework, including supervised fine-tuning and parameter-efficient fine-tuning methods such as low-rank adaptation, or LoRA.

    To boost model quality, developers can align their models with NeMo Aligner and datasets annotated by Nemotron-4 340B Reward. Alignment is a key step in training LLMs, where a model’s behavior is fine-tuned using algorithms like reinforcement learning from human feedback (RLHF) to ensure its outputs are safe, accurate, contextually appropriate and consistent with its intended goals.

    Businesses seeking enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM through the cloud-native NVIDIA AI Enterprise software platform, which provides accelerated and efficient runtimes for generative AI foundation models.

    Evaluating Model Security and Getting Started
    The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adversarial tests, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model’s outputs to ensure the synthetically generated data is suitable, safe and accurate for their use case.

    For more information on model security and safety evaluation, read the model card.

    Download Nemotron-4 340B models via Hugging Face. For more details, read the research papers on the model and dataset.

    See notice regarding software product information.

    ###
    https://arxiv.org/abs/2406.08673
    HelpSteer2: Open-source dataset for training top-performing reward models
    Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev
    High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at this https URL and code is available at this https URL

    ###
    https://research.google/blog/human-io-detecting-situational-impairments-with-large-language-models/
    Home Blog
    Human I/O: Detecting situational impairments with large language models
    June 14, 2024
    Google Research

    Xingyu Bruce Liu, Student Researcher, and Ruofei Du, Interactive Perception & Graphics Lead, Google Augmented Reality

    Human I/O is a unified approach that uses egocentric vision, multimodal sensing, and LLM reasoning to detect situational impairments and assess a user’s ability to interact with their hands, vision, hearing or speech in a given situation.

    Every day, we encounter temporary challenges that can affect our abilities to respond to different situations. These challenges, known as situationally induced impairments and disabilities (SIIDs), can be caused by various environmental factors like noise, lighting, temperature, stress, and even social norms. For example, imagine you're in a loud restaurant and you miss an important phone call because you simply could not hear your phone ring. Or picture yourself trying to respond to a text message while washing dishes; your wet hands and the task at hand make it hard to type a reply. These everyday scenarios show how our surroundings can momentarily reduce our physical, cognitive, or emotional abilities, leading to frustrating experiences.

    In addition, situational impairments can vary greatly and change frequently, which makes it difficult to apply one-size-fits-all solutions that help users with their needs in real-time. For example, think about a typical morning routine: while brushing their teeth, someone might not be able to use voice commands with their smart devices. When washing their face, it could be hard to see and respond to important text messages. And while using a hairdryer, it might be difficult to hear any phone notifications. Even though various efforts have created solutions tailored for specific situations like these, creating manual solutions for every possible situation and combination of challenges isn't really feasible and doesn't work well on a large scale.

    In “Human I/O: Towards a Unified Approach to Detecting Situational Impairments”, which received a Best Paper Honorable Mention Award at CHI 2024, we introduce a generalizable and extensible framework for detecting SIIDs. Rather than devising individual models for activities like face-washing, tooth-brushing, or hair-drying, Human Input/Output (Human I/O) universally assesses the availability of a user’s vision (e.g., to read text messages, watch videos), hearing (e.g., to hear notifications, phone calls), vocal (e.g., to have a conversation, use Google Assistant), and hand (e.g., to use touch screen, gesture control) input/output interaction channels. We describe how Human I/O leverages egocentric vision, multimodal sensing, and reasoning with large language models (LLMs) to achieve an 82% accuracy in availability prediction across 60 in-the-wild egocentric video recordings in 32 different scenarios, and validate it as an interactive system in a lab study with ten participants. We also open-sourced the code.

    Human-I:O-Figure1
    Leveraging multi-modal AI and large language models, we built a pipeline to computationally model these human input/output channels and achieved good real-time performance.

    Formative study
    Prior to developing Human I/O, we conducted a formative study involving ten participants to better understand how different impairments affected their interaction with technology. Our findings highlighted the need for a system that could dynamically adapt to the varying levels of channel availability rather than treating availability as a binary state. Specifically, Human I/O distinguishes between four levels of channel availability, which are crucial for understanding the degree to which a user can engage with their device. These levels include:

    Available
    The channel is currently not involved in any activity, or constrained by any environmental factors. It takes low to zero effort to use the channel to complete a new task.
    Example: A user is sitting at their desk with their hands free, eyes not engaged in a task, and no background noise interfering with their hearing or speech.
    Slightly affected
    The channel is engaged in an activity or constrained by an environmental factor. Given a new task that requires the channel, users can multitask, easily pause and resume the current activity, or easily overcome the situation.
    Example: A user is holding a remote control, which can be set aside to free up their hand for another task.
    Affected
    The channel is involved in an activity or constrained by an environmental factor. Given a new task, the user may experience inconvenience or require some effort to use the channel.
    Example: A user is using both hands to carry groceries, making it challenging to use their hands for other tasks without putting the bags down first.
    Unavailable
    The channel is completely unavailable due to an activity or environmental factor, and the user cannot use it for a new task without substantial changes, significant adaptation or changing the environment.
    Example: A user is attending a loud concert, making it impossible for them to hear incoming notifications or have a conversation.
    Human I/O system pipeline
    The Human I/O system incorporates a pipeline that streams real-time data, processes it to understand the context, and applies reasoning with LLMs to predict channel availability. Here’s a deeper look into the three core components of the system: data streaming, processing module, and reasoning module.

    Data streaming
    The system begins by streaming real-time video and audio data through an egocentric device that has a camera and microphone. This setup provides a first-person view of the user's environment, capturing both visual and auditory details necessary for assessing the context.

    Processing module
    This module processes the raw data to extract the following information:

    Activity recognition: Utilizes computer vision to identify the user's current activities, such as cooking or washing dishes.
    Environment assessment: Determines the setting, such as noisy or quiet environments, using both audio and visual data.
    Direct sensing: Detects finer details, such as whether the user’s hands are occupied, ambient noise levels, and lighting conditions.
    Reasoning module
    The final stage involves analyzing the structured data from the processing module to predict the availability of input/output channels using an LLM with chain-of-thought reasoning. This module processes contextual information and determines how impaired each channel is, guiding how the device should adapt its interactions accordingly. By integrating data streaming, processing, and reasoning, Human I/O dynamically predicts the availability of the user’s input and output channels. We further incorporate a smoothing algorithm for enhanced system stability.

    Human-I:O-Figure2
    The Human I/O pipeline comprises an egocentric device, processing modules that obtains activity, environment, and sensor data, and reasoning module that predicts human channel availability.

    Evaluation
    To validate Human I/O, we conducted an evaluation with 300 clips selected from 60 in-the-wild egocentric video recordings. The system archives a 0.22 mean absolute error (MAE) and an 82% accuracy in predicting channel availability, with 96% of predictions within one step of the actual availability level. These low MAE values indicate that our system’s predictions closely align with the actual availability, with deviations being less than a third of the actual level on average.

    In addition, we conducted an ablation study by introducing Human I/O Lite, which replaces the chain-of-thought reasoning module with a one-shot prompt. For Human I/O Lite, we observe overall slightly inferior performance compared to the full model. However, the MAE for Human I/O Lite is still at a low level around 0.44, showing a promising ability to predict SIIDs even with reduced computational resources.

    Human-I:O-Figure3
    Technical evaluation of Human I/O and Human I/O Lite. We report the MAE, average classification accuracy (ACC), and average intra-video variance (VAR) for four channels and the overall results. Our system estimates availability levels with small margins of error and variance. In Human I/O, 96.0% of predictions are within a discrepancy of 1 step from the actual availability value.

    Additionally, a user study with ten participants showed that Human I/O significantly reduced effort and enhanced user experience in the presence of SIIDs. Participants particularly valued how the system adapted to their real-time context, making digital interactions more seamless and less disruptive. Participants completed the NASA Task Load Index questionnaire, assessing mental demand, physical demand, temporal demand, overall performance, effort, and frustration level on a 7-point scale (from 1–Lowest to 7–Highest). Results suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs. Furthermore, Human I/O users reported a raised awareness of SIIDs that opened up new interaction possibilities.

    Human-I:O-Figure4
    Participants’ ratings to Task Load Index questions (on a scale of 1–low to 7–high) for their experience with SIIDs with and without Human I/O in user study. All rating differences are statistically significant with 𝑝 < 0.001 via Wilcoxon signed-rank tests.

    Conclusion & future work
    Human I/O represents a leap forward in our ability to interact with technology in a context-aware and adaptive manner. By understanding and predicting the availability of our input and output channels, it paves the way for smarter, more intuitive user interfaces that can enhance productivity and accessibility for everyone, regardless of the situational challenges they face.

    Meanwhile, maintaining privacy and upholding ethical standards are crucial in the design and deployment of SIIDs systems with active cameras and microphones on wearables. For example, a combination of on-device inference (e.g., Gemini Nano) and federated learning can forestall potential data breaches. Future research may incorporate more sensing techniques, such as depth sensing, ultra-wideband, and eye tracking, to provide users with finer controls over how their devices adapt to their changing needs and situations.

    We envision this technology not only improving individual device interactions but also serving as a foundation for future developments in ubiquitous computing.

    ###
    https://www.lamini.ai/blog/lamini-memory-tuning
    Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations
    TLDR:
    Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.
    Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.
    The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.
    Contact us to try Lamini Memory Tuning.


    ‍Accuracy matters immensely
    Yet, general-purpose LLMs are designed to hallucinate, because they are trained to reduce the average error across the examples they’ve seen. They’re pretty good at everything, but perfect at nothing. They can produce fluent English prose because they’ve seen so much of it across the internet, but specific facts—like a date, a revenue number, or a variable name—get muddled in probabilities. As a result, companies have not been able to count on LLMs for the most critical and most valuable use cases – until now.

    Introducing Lamini Memory Tuning
    Lamini Memory Tuning is a completely new way to fine-tune any existing LLM by tuning millions of LoRA adapters and selecting across them in a wide Mixture of Experts at inference time.
    Instead of optimizing average error on everything, Lamini Memory Tuning optimizes for zero error on the specific facts you tell it to remember, so it recalls those facts nearly perfectly. That’s not special on its own. This approach is particularly groundbreaking because it preserves the LLM’s ability to generalize with average error on everything else, and thus continue to produce fluent prose around those facts. Lamini Memory Tuning is a systematic tool for eliminating hallucinations on the facts you care about.

    Fortune 500 customers are already using Lamini Memory Tuning to achieve 95% factual accuracy on critical use cases where previous state-of-the-art approaches peaked at 50%.



    THE PROBLEM

    Prompting and RAG: necessary but not sufficient
    Prompting and Retrieval Augmented Generation (RAG) are important methods for surfacing relevant information to the model, shifting its probabilities to consider similar information. This is an important step to getting the model to condition on the right concepts and information, because the model has been trained on so many tasks. Good prompt-engineering and RAG pipelines are critical to improve the overall accuracy of the model.

    At times, this is all you need. But other times, you provide the relevant information and the response is still wrong but so close to right — leading to hallucinations.


    Why do hallucinations happen with the right data? In the model’s internal representation, the right answer is likely clustered with similar, but wrong, options. The right context increases the probabilities of the right answer and nearby wrong options. The model doesn’t know that a nearly right answer is still wrong, because general models don’t distinguish between exactly right and nearly right — they never learned to take the loss on those answers to zero. Prompting and RAG don’t change that.

    Lamini Memory Tuning addresses this directly, by combining methods from information retrieval and AI to teach the model that getting the answer nearly right is the same as getting it totally wrong.

    Instruction fine-tuning: the wrong tool for the job
    Many teams turn to instruction fine-tuning when other techniques hit a wall on factual accuracy, but instruction fine-tuning, with or without LoRAs, lead to the same issue that pre-training has: it gets to be pretty good at a more narrow dataset, but still perfect at nothing, while being finicky to work with (losing the ability to perform on some general tasks, if you do it wrong).

    As a result, teams struggle with unclear choices, long feedback loops, high compute bills, and ultimately underwhelming performance improvements. While instruction fine-tuning can be really valuable (it’s what turned GPT-3 into ChatGPT), it doesn't make models perfect at the facts that matter. In other words, traditional fine-tuning does not ensure that the model's answers are faithful to facts in its training data.

    This is why we developed Lamini Memory Tuning.



    OUR INNOVATION

    Lamini Memory Tuning: near-perfect fact recall via 1 million-way MoE
    Lamini Memory Tuning is a fundamentally different fine-tuning approach that effectively teaches any open-source LLM to be near-perfect on facts, while still maintaining its ability to be pretty good at everything else. When the model is supposed to recall a specific fact, Lamini Memory Tuning shifts the entire probability mass to that particular fact (i.e. specific tokens within a particular context), such as the exact SQL schema for your database. This results in output probabilities that are not just closer to the right result, but exactly there.


    To do this, Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM. Each memory expert acts like a LoRA adapter that functionally operates as memory for the model. Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes. At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.


    The result is a sparsely activated model, called a Mixture of Memory Experts (MoME), that can scale to an enormous number of parameters at a fixed computational inference cost. This means MoMEs have extremely high capacity for the number of facts that can be learned, bounded only by the total size of the training data set. Llama 3 was trained on 15 trillion tokens. Realistically, you will run out of system memory before you run out of memory capacity in a MoME.

    Ultimately, this approach makes what were impossible use cases that critically suffer from hallucinations within reach, and drastically improves LLM time-to-accuracy and thus time-to-market.

    Read more details in our research paper.

    Results
    Lamini Memory Tuning has been a game-changing capability with Lamini’s Fortune 500 clients, who are deploying it for the following use cases:

    High precision text-to-SQL
    Client need: Democratize data access by using LLMs to turn natural language questions into database queries.
    Challenge: The relevant databases had unique internal names and large, messy schemas.
    Result: We achieved 95% accuracy with Lamini Memory Tuning after 50% accuracy with RAG.
    High precision classification
    Client need: Save thousands of hours by automatically labeling data accurately.
    Challenge: We had to adhere to an exact taxonomy of 900 categories.
    Result: We achieved 100% accuracy across thousands of documents.
    High precision recommendations
    Client need: Increase cart size and revenue with AI-powered product suggestions.
    Challenge: Applications break when product IDs are hallucinated.
    Result: We achieved 88% accuracy across a 50,000 product database.
    A new frontier

    Lamini Memory Tuning changes several of the fundamental dynamics and tradeoffs governing how we work with LLMs. We’re in the early days of this new paradigm, and we’re still learning alongside our customers what’s possible. Summarizing a few areas we’re most excited about:

    Higher accuracy enables full automation as opposed to copiloting.
    Lower costs let you take your product from internal demos to a wider production audience.
    Lower latency enables seamless user experiences.
    Smaller models mean faster development and improvement cycles.
    What could you do with models that ran faster, were more accurate, and cost less to develop and run?


    Start using Lamini Memory Tuning
    Because Lamini Memory Tuning is a cutting-edge technique that embeds your unique data in a new model architecture, we’re exclusively working with select partners.

    Contact us to try Lamini Memory Tuning.

    Want to learn more?

    Read the research paper for details on the underlying approach.
    Read the case study to see how a Fortune 500 company is using Lamini Memory Tuning for a 95% accurate text-to-SQL agent.

    ###
    https://link.springer.com/content/pdf/10.1007/s10676-024-09775-5.pdf
    ChatGPT is bullshit
    Michael Townsen Hicks1 · James Humphries1
    · Joe Slater1

    t
    Recently, there has been considerable interest in large language models: machine learning systems which produce humanlike text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are
    often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better
    understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important
    way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters,
    and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations
    as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.

    ###
    https://towardsdatascience.com/multi-ai-agent-systems-101-bac58e3bcc47
    Multi AI Agent Systems 101
    Automating Routine Tasks in Data Source Management with CrewAI
    Mariya Mansurova
    Towards Data Science
    Mariya Mansurova

    ·
    Follow

    Published in
    Towards Data Science

    ·
    26 min read
    ·
    7 hours ago
    100





    Image by DALL-E 3
    Initially, when ChatGPT just appeared, we used simple prompts to get answers to our questions. Then, we encountered issues with hallucinations and began using RAG (Retrieval Augmented Generation) to provide more context to LLMs. After that, we started experimenting with AI agents, where LLMs act as a reasoning engine and can decide what to do next, which tools to use, and when to return the final answer.

    The next evolutionary step is to create teams of such agents that can collaborate with each other. This approach is logical as it mirrors human interactions. We work in teams where each member has a specific role:

    The product manager proposes the next project to work on.
    The designer creates its look and feel.
    The software engineer develops the solution.
    The analyst examines the data to ensure it performs as expected and identifies ways to improve the product for customers.
    Similarly, we can create a team of AI agents, each focusing on one domain. They can collaborate and reach a final conclusion together. Just as specialization enhances performance in real life, it could also benefit the performance of AI agents.

    Another advantage of this approach is increased flexibility. Each agent can operate with its own prompt, set of tools and even LLM. For instance, we can use different models for different parts of our system. You can use GPT-4 for the agent that needs more reasoning and GPT-3.5 for the one that does only simple extraction. We can even fine-tune the model for small specific tasks and use it in our crew of agents.

    The potential drawbacks of this approach are time and cost. Multiple interactions and knowledge sharing between agents require more calls to LLM and consume additional tokens. This could result in longer wait times and increased expenses.

    There are several frameworks available for multi-agent systems today.
    Here are some of the most popular ones:

    AutoGen: Developed by Microsoft, AutoGen uses a conversational approach and was one of the earliest frameworks for multi-agent systems,
    LangGraph: While not strictly a multi-agent framework, LangGraph allows for defining complex interactions between actors using a graph structure. So, it can also be adapted to create multi-agent systems.
    CrewAI: Positioned as a high-level framework, CrewAI facilitates the creation of “crews” consisting of role-playing agents capable of collaborating in various ways.
    I’ve decided to start experimenting with multi-agent frameworks from CrewAI since it’s quite widely popular and user friendly. So, it looks like a good option to begin with.

    In this article, I will walk you through how to use CrewAI. As analysts, we’re the domain experts responsible for documenting various data sources and addressing related questions. We’ll explore how to automate these tasks using multi-agent frameworks.

    Setting up the environment
    Let’s start with setting up the environment. First, we need to install the CrewAI main package and an extension to work with tools.

    pip install crewai
    pip install 'crewai[tools]'
    CrewAI was developed to work primarily with OpenAI API, but I would also like to try it with a local model. According to the ChatBot Arena Leaderboard, the best model you can run on your laptop is Llama 3 (8b parameters). It will be the most feasible option for our use case.

    We can access Llama models using Ollama. Installation is pretty straightforward. You need to download Ollama from the website and then go through the installation process. That’s it.

    Now, you can test the model in CLI by running the following command.

    ollama run llama3
    For example, you can ask something like this.


    Let’s create a custom Ollama model to use later in CrewAI.

    We will start with a ModelFile (documentation). I only specified the base model (llama3), temperature and stop sequence. However, you might add more features. For example, you can determine the system message using SYSTEM keyword.

    FROM llama3

    # set parameters
    PARAMETER temperature 0.5
    PARAMETER stop Result
    I’ve saved it into a Llama3ModelFile file.

    Let’s create a bash script to load the base model for Ollama and create the custom model we defined in ModelFile.

    #!/bin/zsh

    # define variables
    model_name="llama3"
    custom_model_name="crewai-llama3"

    # load the base model
    ollama pull $model_name

    # create the model file
    ollama create $custom_model_name -f ./Llama3ModelFile
    Let’s execute this file.

    chmod +x ./llama3_setup.sh
    ./llama3_setup.sh
    You can find both files on GitHub: Llama3ModelFile and llama3_setup.sh

    We need to initialise the following environmental variables to use the local Llama model with CrewAI.

    os.environ["OPENAI_API_BASE"]='http://localhost:11434/v1'

    os.environ["OPENAI_MODEL_NAME"]='crewai-llama3'
    # custom_model_name from the bash script

    os.environ["OPENAI_API_KEY"] = "NA"
    We’ve finished the setup and are ready to continue our journey.

    Use cases: working with documentation
    As analysts, we often play the role of subject matter experts for data and some data-related tools. In my previous team, we used to have a channel with almost 1K participants, where we were answering lots of questions about our data and the ClickHouse database we used as storage. It took us quite a lot of time to manage this channel. It would be interesting to see whether such tasks can be automated with LLMs.

    For this example, I will use the ClickHouse database. If you’re interested, You can learn more about ClickHouse and how to set it up locally in my previous article. However, we won’t utilise any ClickHouse-specific features, so feel free to stick to the database you know.

    I’ve created a pretty simple data model to work with. There are just two tables in our DWH (Data Warehouse): ecommerce_db.users and ecommerce_db.sessions. As you might guess, the first table contains information about the users of our service.


    The ecommerce_db.sessions table stores information about user sessions.


    Regarding data source management, analysts typically handle tasks like writing and updating documentation and answering questions about this data. So, we will use LLM to write documentation for the table in the database and teach it to answer questions about data or ClickHouse.

    But before moving on to the implementation, let’s learn more about the CrewAI framework and its core concepts.

    CrewAI basic concepts
    The cornerstone of a multi-agent framework is an agent concept. In CrewAI, agents are powered by role-playing. Role-playing is a tactic when you ask an agent to adopt a persona and behave like a top-notch backend engineer or helpful customer support agent. So, when creating a CrewAI agent, you need to specify each agent's role, goal, and backstory so that LLM knows enough to play this role.

    The agents’ capabilities are limited without tools (functions that agents can execute and get results). With CrewAI, you can use one of the predefined tools (for example, to search the Internet, parse a website, or do RAG on a document), create a custom tool yourself or use LangChain tools. So, it’s pretty easy to create a powerful agent.

    Let’s move on from agents to the work they are doing. Agents are working on tasks (specific assignments). For each task, we need to define a description, expected output (definition of done), set of available tools and assigned agent. I really like that these frameworks follow the managerial best practices like a clear definition of done for the tasks.

    The next question is how to define the execution order for tasks: which one to work on first, which ones can run in parallel, etc. CrewAI implemented processes to orchestrate the tasks. It provides a couple of options:

    Sequential —the most straightforward approach when tasks are called one after another.
    Hierarchical — when there’s a manager (specified as LLM model) that creates and delegates tasks to the agents.
    Also, CrewAI is working on a consensual process. In such a process, agents will be able to make decisions collaboratively with a democratic approach.

    There are other levers you can use to tweak the process of tasks’ execution:

    You can mark tasks as “asynchronous”, then they will be executed in parallel, so you will be able to get an answer faster.
    You can use the “human input” flag on a task, and then the agent will ask for human approval before finalising the output of this task. It can allow you to add an oversight to the process.
    We’ve defined all the primary building blocks and can discuss the holly grail of CrewAI — crew concept. The crew represents the team of agents and the set of tasks they will be working on. The approach for collaboration (processes we discussed above) can also be defined at the crew level.

    Also, we can set up the memory for a crew. Memory is crucial for efficient collaboration between the agents. CrewAI supports three levels of memory:

    Short-term memory stores information related to the current execution. It helps agents to work together on the current task.
    Long-term memory is data about the previous executions stored in the local database. This type of memory allows agents to learn from earlier iterations and improve over time.
    Entity memory captures and structures information about entities (like personas, cities, etc.)
    Right now, you can only switch on all types of memory for a crew without any further customisation. However, it doesn’t work with the Llama models.

    We’ve learned enough about the CrewAI framework, so it’s time to start using this knowledge in practice.

    Use case: writing documentation
    Let’s start with a simple task: putting together the documentation for our DWH. As we discussed before, there are two tables in our DWH, and I would like to create a detailed description for them using LLMs.

    First approach
    In the beginning, we need to think about the team structure. Think of this as a typical managerial task. Who would you hire for such a job?

    I would break this task into two parts: retrieving data from a database and writing documentation. So, we need a database specialist and a technical writer. The database specialist needs access to a database, while the writer won’t need any special tools.


    Now, we have a high-level plan. Let’s create the agents.

    For each agent, I’ve specified the role, goal and backstory. I’ve tried my best to provide agents with all the needed context.

    database_specialist_agent = Agent(
    role = "Database specialist",
    goal = "Provide data to answer business questions using SQL",
    backstory = '''You are an expert in SQL, so you can help the team
    to gather needed data to power their decisions.
    You are very accurate and take into account all the nuances in data.''',
    allow_delegation = False,
    verbose = True
    )

    tech_writer_agent = Agent(
    role = "Technical writer",
    goal = '''Write engaging and factually accurate technical documentation
    for data sources or tools''',
    backstory = '''
    You are an expert in both technology and communications, so you can easily explain even sophisticated concepts.
    You base your work on the factual information provided by your colleagues.
    Your texts are concise and can be easily understood by a wide audience.
    You use professional but rather an informal style in your communication.
    ''',
    allow_delegation = False,
    verbose = True
    )
    We will use a simple sequential process, so there’s no need for agents to delegate tasks to each other. That’s why I specified allow_delegation = False.

    The next step is setting the tasks for agents. But before moving to them, we need to create a custom tool to connect to the database.

    First, I put together a function to execute ClickHouse queries using HTTP API.

    CH_HOST = 'http://localhost:8123' # default address

    def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
    r = requests.post(host, params = {'query': query},
    timeout = connection_timeout)
    if r.status_code == 200:
    return r.text
    else:
    return 'Database returned the following error:\n' + r.text
    When working with LLM agents, it’s important to make tools fault-tolerant. For example, if the database returns an error (status_code != 200), my code won’t throw an exception. Instead, it will return the error description to the LLM so it can attempt to resolve the issue.

    To create a CrewAI custom tool, we need to derive our class from crewai_tools.BaseTool, implement the _run method and then create an instance of this class.

    from crewai_tools import BaseTool

    class DatabaseQuery(BaseTool):
    name: str = "Database Query"
    description: str = "Returns the result of SQL query execution"

    def _run(self, sql_query: str) -> str:
    # Implementation goes here
    return get_clickhouse_data(sql_query)

    database_query_tool = DatabaseQuery()
    Now, we can set the tasks for the agents. Again, providing clear instructions and all the context to LLM is crucial.

    table_description_task = Task(
    description = '''Provide the comprehensive overview for the data
    in table {table}, so that it's easy to understand the structure
    of the data. This task is crucial to put together the documentation
    for our database''',
    expected_output = '''The comprehensive overview of {table} in the md format.
    Include 2 sections: columns (list of columns with their types)
    and examples (the first 30 rows from table).''',
    tools = [database_query_tool],
    agent = database_specialist_agent
    )

    table_documentation_task = Task(
    description = '''Using provided information about the table,
    put together the detailed documentation for this table so that
    people can use it in practice''',
    expected_output = '''Well-written detailed documentation describing
    the data scheme for the table {table} in markdown format,
    that gives the table overview in 1-2 sentences then then
    describes each columm. Structure the columns description
    as a markdown table with column name, type and description.''',
    tools = [],
    output_file="table_documentation.md",
    agent = tech_writer_agent
    )
    You might have noticed that I’ve used {table} placeholder in the tasks’ descriptions. We will use table as an input variable when executing the crew, and this value will be inserted into all placeholders.

    Also, I’ve specified the output file for the table documentation task to save the final result locally.

    We have all we need. Now, it’s time to create a crew and execute the process, specifying the table we are interested in. Let’s try it with the users table.

    crew = Crew(
    agents = [database_specialist_agent, tech_writer_agent],
    tasks = [table_description_task, table_documentation_task],
    verbose = 2
    )

    result = crew.kickoff({'table': 'ecommerce_db.users'})
    It’s an exciting moment, and I’m really looking forward to seeing the result. Don’t worry if execution takes some time. Agents make multiple LLM calls, so it’s perfectly normal for it to take a few minutes. It took 2.5 minutes on my laptop.

    We asked LLM to return the documentation in markdown format. We can use the following code to see the formatted result in Jupyter Notebook.

    from IPython.display import Markdown
    Markdown(result)
    At first glance, it looks great. We’ve got the valid markdown file describing the users' table.


    But wait, it’s incorrect. Let’s see what data we have in our table.


    The columns listed in the documentation are completely different from what we have in the database. It’s a case of LLM hallucinations.

    We’ve set verbose = 2 to get the detailed logs from CrewAI. Let’s read through the execution logs to identify the root cause of the problem.

    First, the database specialist couldn’t query the database due to complications with quotes.


    The specialist didn’t manage to resolve this problem. Finally, this chain has been terminated by CrewAI with the following output: Agent stopped due to iteration limit or time limit.

    This means the technical writer didn’t receive any factual information about the data. However, the agent continued and produced completely fake results. That’s how we ended up with incorrect documentation.

    Fixing the issues
    Even though our first iteration wasn’t successful, we’ve learned a lot. We have (at least) two areas for improvement:

    Our database tool is too difficult for the model, and the agent struggles to use it. We can make the tool more tolerant by removing quotes from the beginning and end of the queries. This solution is not ideal since valid SQL can end with a quote, but let’s try it.
    Our technical writer isn’t basing its output on the input from the database specialist. We need to tweak the prompt to highlight the importance of providing only factual information.
    So, let’s try to fix these problems. First, we will fix the tool — we can leverage strip to eliminate quotes.

    CH_HOST = 'http://localhost:8123' # default address

    def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
    r = requests.post(host, params = {'query': query.strip('"').strip("'")},
    timeout = connection_timeout)
    if r.status_code == 200:
    return r.text
    else:
    return 'Database returned the following error:\n' + r.text
    Then, it’s time to update the prompt. I’ve included statements emphasizing the importance of sticking to the facts in both the agent and task definitions.


    tech_writer_agent = Agent(
    role = "Technical writer",
    goal = '''Write engaging and factually accurate technical documentation
    for data sources or tools''',
    backstory = '''
    You are an expert in both technology and communications, so you
    can easily explain even sophisticated concepts.
    Your texts are concise and can be easily understood by wide audience.
    You use professional but rather informal style in your communication.
    You base your work on the factual information provided by your colleagues.
    You stick to the facts in the documentation and use ONLY
    information provided by the colleagues not adding anything.''',
    allow_delegation = False,
    verbose = True
    )

    table_documentation_task = Task(
    description = '''Using provided information about the table,
    put together the detailed documentation for this table so that
    people can use it in practice''',
    expected_output = '''Well-written detailed documentation describing
    the data scheme for the table {table} in markdown format,
    that gives the table overview in 1-2 sentences then then
    describes each columm. Structure the columns description
    as a markdown table with column name, type and description.
    The documentation is based ONLY on the information provided
    by the database specialist without any additions.''',
    tools = [],
    output_file = "table_documentation.md",
    agent = tech_writer_agent
    )
    Let’s execute our crew once again and see the results.


    We’ve achieved a bit better result. Our database specialist was able to execute queries and view the data, which is a significant win for us. Additionally, we can see all the relevant fields in the result table, though there are lots of other fields as well. So, it’s still not entirely correct.

    I once again looked through the CrewAI execution log to figure out what went wrong. The issue lies in getting the list of columns. There’s no filter by database, so it returns some unrelated columns that appear in the result.

    SELECT column_name
    FROM information_schema.columns
    WHERE table_name = 'users'
    Also, after looking at multiple attempts, I noticed that the database specialist, from time to time, executes select * from <table> query. It might cause some issues in production as it might generate lots of data and send it to LLM.

    More specialised tools
    We can provide our agent with more specialised tools to improve our solution. Currently, the agent has a tool to execute any SQL query, which is flexible and powerful but prone to errors. We can create more focused tools, such as getting table structure and top-N rows from the table. Hopefully, it will reduce the number of mistakes.

    class TableStructure(BaseTool):
    name: str = "Table structure"
    description: str = "Returns the list of columns and their types"

    def _run(self, table: str) -> str:
    table = table.strip('"').strip("'")
    return get_clickhouse_data(
    'describe {table} format TabSeparatedWithNames'\
    .format(table = table)
    )

    class TableExamples(BaseTool):
    name: str = "Table examples"
    description: str = "Returns the first N rows from the table"

    def _run(self, table: str, n: int = 30) -> str:
    table = table.strip('"').strip("'")
    return get_clickhouse_data(
    'select * from {table} limit {n} format TabSeparatedWithNames'
    \.format(table = table, n = n)
    )

    table_structure_tool = TableStructure()
    table_examples_tool = TableExamples()
    Now, we need to specify these tools in the task and re-run our script. After the first attempt, I got the following output from the Technical Writer.

    Task output: This final answer provides a detailed and factual description
    of the ecommerce_db.users table structure, including column names, types,
    and descriptions. The documentation adheres to the provided information
    from the database specialist without any additions or modifications.
    More focused tools helped the database specialist retrieve the correct table information. However, even though the writer had all the necessary information, we didn’t get the expected result.

    As we know, LLMs are probabilistic, so I gave it another try. And hooray, this time, the result was pretty good.


    It’s not perfect since it still includes some irrelevant comments and lacks the overall description of the table. However, providing more specialised tools has definitely paid off. It also helped to prevent issues when the agent tried to load all the data from the table.

    Quality assurance specialist
    We’ve achieved pretty good results, but let’s see if we can improve them further. A common practice in multi-agent setups is quality assurance, which adds the final review stage before finalising the results.


    Let’s create a new agent — a Quality Assurance Specialist, who will be in charge of review.

    qa_specialist_agent = Agent(
    role = "Quality Assurance specialist",
    goal = """Ensure the highest quality of the documentation we provide
    (that it's correct and easy to understand)""",
    backstory = '''
    You work as a Quality Assurance specialist, checking the work
    from the technical writer and ensuring that it's inline
    with our highest standards.
    You need to check that the technical writer provides the full complete
    answers and make no assumptions.
    Also, you need to make sure that the documentation addresses
    all the questions and is easy to understand.
    ''',
    allow_delegation = False,
    verbose = True
    )
    Now, it’s time to describe the review task. I’ve used the context parameter to specify that this task requires outputs from both table_description_task and table_documentation_task.

    qa_review_task = Task(
    description = '''
    Review the draft documentation provided by the technical writer.
    Ensure that the documentation fully answers all the questions:
    the purpose of the table and its structure in the form of table.
    Make sure that the documentation is consistent with the information
    provided by the database specialist.
    Double check that there are no irrelevant comments in the final version
    of documentation.
    ''',
    expected_output = '''
    The final version of the documentation in markdown format
    that can be published.
    The documentation should fully address all the questions, be consistent
    and follow our professional but informal tone of voice.
    ''',
    tools = [],
    context = [table_description_task, table_documentation_task],
    output_file="checked_table_documentation.md",
    agent = qa_specialist_agent
    )
    Let’s update our crew and run it.

    full_crew = Crew(
    agents=[database_specialist_agent, tech_writer_agent, qa_specialist_agent],
    tasks=[table_description_task, table_documentation_task, qa_review_task],
    verbose = 2,
    memory = False # don't work with Llama
    )

    full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
    We now have more structured and detailed documentation thanks to the addition of the QA stage.


    Delegation
    With the addition of the QA specialist, it would be interesting to test the delegation mechanism. The QA specialist agent might have questions or requests that it could delegate to other agents.

    I tried using the delegation with Llama 3, but it didn’t go well. Llama 3 struggled to call the co-worker tool correctly. It couldn’t specify the correct co-worker’s name.

    We achieved pretty good results with a local model that can run on any laptop, but now it’s time to switch gears and use a way more powerful model — GPT-4o.

    To do it, we just need to update the following environment variables.

    os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o'
    os.environ["OPENAI_API_KEY"] = config['OPENAI_API_KEY'] # your OpenAI key
    To switch on the delegation, we should specify allow_delegation = True for the QA specialist agent.

    Also, we can use handy memory functionality for OpenAI models, as mentioned above. The memory will allow our agents to share their knowledge with each other during execution and leverage long-term memory to get information from previous executions.

    full_crew = Crew(
    agents = [database_specialist_agent, tech_writer_agent, qa_specialist_agent],
    tasks = [table_description_task, table_documentation_task, qa_review_task],
    verbose = 2,
    memory = True
    )

    full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
    Let’s see the CrewAI execution logs to understand how delegation works. Here are all the logs for the QA specialist. We can see that it reached out to the database specialist to double-check the information.

    Thought: I need to review the provided draft documentation to ensure it
    fully answers all the questions and is complete, correct, and easy
    to understand. Specifically, I need to ensure that it explains the purpose
    of the table, its structure, and includes relevant examples without
    any irrelevant comments.

    First, I will compare the two drafts to identify any discrepancies and
    missing information. Then, I will verify the accuracy and completeness of
    the information provided.

    **Comparison:**
    1. **Introduction:**
    - First draft: Includes an introduction and examples.
    - Second draft: Adds a more detailed introduction about the purpose and importance of the table.

    2. **Columns:**
    - Both drafts have the same column structure and descriptions.

    3. **Additional Information:**
    - Both drafts include similar additional information, but the second draft is more concise.

    4. **Examples:**
    - The first draft includes examples, which are missing in the second draft.

    5. **Final Checks:**
    - Ensure there are no irrelevant comments.
    - Verify consistency with information provided by the database specialist.

    Action: Ask question to co-worker
    Action Input: {
    "question": "Can you confirm the accuracy of the provided column descriptions
    and examples for the `ecommerce_db.users` table?",
    "context": "We are finalizing the documentation for the `ecommerce_db.users`
    table. The table includes columns `user_id`, `country`, `is_active`, and
    `age`. Here are the examples provided:

    | user_id | country | is_active | age |
    |---------|----------------|-----------|-----|
    | 1000001 | United Kingdom | 0 | 70 |
    | 1000002 | France | 1 | 87 |
    | 1000003 | France | 1 | 88 |
    | 1000004 | Germany | 1 | 25 |
    | 1000005 | Germany | 1 | 48 |
    | 1000006 | France | 1 | 78 |
    | 1000007 | Netherlands | 1 | 65 |
    | 1000008 | Germany | 1 | 31 |
    | 1000009 | Germany | 1 | 66 |
    | 1000010 | Germany | 1 | 73 |",
    "coworker": "database specialist"}

    <-- omitted logs from the database specialist -->

    Based on the provided context and examples for the `ecommerce_db.users` table,
    the column descriptions and examples appear accurate. Here are the details:
    - `user_id`: This column contains a unique identifier for each user.
    The provided examples (e.g., 1000001, 1000002) align with this description
    as they are unique numeric identifiers.

    - `country`: This column specifies the country of the user. The provided
    examples (e.g., United Kingdom, France, Germany, Netherlands) are
    all valid country names.

    - `is_active`: This column indicates whether the user is active (1) or not (0). The provided examples (e.g., 0, 1) correctly represent this binary status.

    - `age`: This column shows the age of the user. The provided examples (e.g., 70,
    87, 88, 25, 48, 78, 65, 31, 66, 73) are all valid ages.

    Therefore, the column descriptions and examples provided for the
    `ecommerce_db.users` table are accurate and meet the expected criteria.

    Thought: Based on the observation from the database specialist,
    the column descriptions and examples provided for the `ecommerce_db.users`
    table are accurate. Now, I will consolidate the best elements from both drafts,
    ensuring the documentation is complete, clear, and free of irrelevant comments.

    Action: Finalize the documentation in markdown format, incorporating
    the detailed introduction, examples, and ensuring the tone is professional
    but informal.
    When I tried the delegation for the first time, I didn’t enable memory, which led to incorrect results. The data specialist and the technical writer initially returned the correct information. However, when the QA specialist returned with the follow-up questions, they started to hallucinate. So, it looks like delegation works better when memory is enabled.

    Here’s the final output from GPT-4o. The result looks pretty nice now. We definitely can use LLMs to automate documentation.


    So, the first task has been solved!

    I used the same script to generate documentation for the ecommerce_db.sessions table as well. It will be handy for our next task. So, let’s not waste any time and move on.

    Use case: answering questions
    Our next task is answering questions based on the documentation since it’s common for many data analysts (and other specialists).

    We will start simple and will create just two agents:

    The documentation support specialist will be answering questions based on the docs,
    The support QA agent will review the answer before sharing it with the customer.

    We will need to empower the documentation specialist with a couple of tools that will allow them to see all the files stored in the directory and read the files. It’s pretty straightforward since CrewAI has implemented such tools.

    from crewai_tools import DirectoryReadTool, FileReadTool

    documentation_directory_tool = DirectoryReadTool(
    directory = '~/crewai_project/ecommerce_documentation')

    base_file_read_tool = FileReadTool()
    However, since Llama 3 keeps struggling with quotes when calling tools, I had to create a custom tool on top of the FileReaderTool to overcome this issue.

    from crewai_tools import BaseTool

    class FileReadToolUPD(BaseTool):
    name: str = "Read a file's content"
    description: str = "A tool that can be used to read a file's content."

    def _run(self, file_path: str) -> str:
    # Implementation goes here
    return base_file_read_tool._run(file_path = file_path.strip('"').strip("'"))

    file_read_tool = FileReadToolUPD()
    Next, as we did before, we need to create agents, tasks and crew.

    data_support_agent = Agent(
    role = "Senior Data Support Agent",
    goal = "Be the most helpful support for you colleagues",
    backstory = '''You work as a support for data-related questions
    in the company.
    Even though you're a big expert in our data warehouse, you double check
    all the facts in documentation.
    Our documentation is absolutely up-to-date, so you can fully rely on it
    when answering questions (you don't need to check the actual data
    in database).
    Your work is very important for the team success. However, remember
    that examples of table rows don't show all the possible values.
    You need to ensure that you provide the best possible support: answering
    all the questions, making no assumptions and sharing only the factual data.
    Be creative try your best to solve the customer problem.
    ''',
    allow_delegation = False,
    verbose = True
    )

    qa_support_agent = Agent(
    role = "Support Quality Assurance Agent",
    goal = """Ensure the highest quality of the answers we provide
    to the customers""",
    backstory = '''You work as a Quality Assurance specialist, checking the work
    from support agents and ensuring that it's inline with our highest standards.
    You need to check that the agent provides the full complete answers
    and make no assumptions.
    Also, you need to make sure that the documentation addresses all
    the questions and is easy to understand.
    ''',
    allow_delegation = False,
    verbose = True
    )

    draft_data_answer = Task(
    description = '''Very important customer {customer} reached out to you
    with the following question:

    {question}


    Your task is to provide the best answer to all the points in the question
    using all available information and not making any assumprions.
    If you don't have enough information to answer the question, just say
    that you don't know.''',
    expected_output = '''The detailed informative answer to the customer's
    question that addresses all the point mentioned.
    Make sure that answer is complete and stict to facts
    (without any additional information not based on the factual data)''',
    tools = [documentation_directory_tool, file_read_tool],
    agent = data_support_agent
    )

    answer_review = Task(
    description = '''
    Review the draft answer provided by the support agent.
    Ensure that the it fully answers all the questions mentioned
    in the initial inquiry.
    Make sure that the answer is consistent and doesn't include any assumptions.
    ''',
    expected_output = '''
    The final version of the answer in markdown format that can be shared
    with the customer.
    The answer should fully address all the questions, be consistent
    and follow our professional but informal tone of voice.
    We are very chill and friendly company, so don't forget to include
    all the polite phrases.
    ''',
    tools = [],
    agent = qa_support_agent
    )

    qna_crew = Crew(
    agents = [data_support_agent, qa_support_agent],
    tasks = [draft_data_answer, answer_review],
    verbose = 2,
    memory = False # don't work with Llama
    )
    Let’s see how it works in practice.

    result = qna_crew.kickoff(
    {'customer': "Max",
    'question': """Hey team, I hope you're doing well. I need to find
    the numbers before our CEO presentation tomorrow, so I will really
    appreciate your help.
    I need to calculate the number of sessions from our Windows users in 2023. I've tried to find the table with such data in our data warehouse, but wasn't able to.
    Do you have any ideas whether we store the needed data somewhere,
    so that I can query it? """
    }
    )
    We’ve got a polite, practical and helpful answer in return. That’s really great.

    **Hello Max,**

    Thank you for reaching out with your question! I'm happy to help you
    find the number of sessions from Windows users in 2023.
    After reviewing our documentation, I found that we do store data
    related to sessions and users in our ecommerce database, specifically in
    the `ecommerce_db.sessions` table.

    To answer your question, I can provide you with a step-by-step guide
    on how to query this table using SQL. First, you can use the `session_id`
    column along with the `os` column filtering for "Windows" and
    the `action_date` column filtering for dates in 2023.
    Then, you can group the results by `os` using the `GROUP BY` clause
    to count the number of sessions that meet these conditions.

    Here's a sample SQL query that should give you the desired output:

    SELECT COUNT(*)
    FROM ecommerce_db.sessions
    WHERE os = 'Windows'
    AND action_date BETWEEN '2023-01-01' AND '2023-12-31'
    GROUP BY os;

    This query will return the total number of sessions from Windows
    users in 2023. I hope this helps! If you have any further questions or
    need more assistance, please don't hesitate to ask.
    Let’s complicate the task a bit. Suppose we can get not only questions about our data but also about our tool (ClickHouse). So, we will have another agent in the crew — ClickHouse Guru. To give our CH agent some knowledge, I will share a documentation website with it.

    from crewai_tools import ScrapeWebsiteTool, WebsiteSearchTool
    ch_documenation_tool = ScrapeWebsiteTool(
    'https://clickhouse.com/docs/en/guides/creating-tables')
    If you need to work with a lengthy document, you might try using RAG (Retrieval Augmented generation) — WebsiteSearchTool. It will calculate embeddings and store them locally in ChromaDB. In our case, we will stick to a simple website scraper tool.

    Now that we have two subject matter experts, we need to decide who will be working on the questions. So, it’s time to use a hierarchical process and add a manager to orchestrate all the tasks.


    CrewAI provides the manager implementation, so we only need to specify the LLM model. I’ve picked the GPT-4o.

    from langchain_openai import ChatOpenAI
    from crewai import Process

    complext_qna_crew = Crew(
    agents = [ch_support_agent, data_support_agent, qa_support_agent],
    tasks = [draft_ch_answer, draft_data_answer, answer_review],
    verbose = 2,
    manager_llm = ChatOpenAI(model='gpt-4o', temperature=0),
    process = Process.hierarchical,
    memory = False
    )
    At this point, I had to switch from Llama 3 to OpenAI models again to run a hierarchical process since it hasn’t worked for me with Llama (similar to this issue).

    Now, we can try our new crew with different types of questions (either related to our data or ClickHouse database).

    ch_result = complext_qna_crew.kickoff(
    {'customer': "Maria",
    'question': """Good morning, team. I'm using ClickHouse to calculate
    the number of customers.
    Could you please remind whether there's an option to add totals
    in ClickHouse?"""
    }
    )

    doc_result = complext_qna_crew.kickoff(
    {'customer': "Max",
    'question': """Hey team, I hope you're doing well. I need to find
    the numbers before our CEO presentation tomorrow, so I will really
    appreciate your help.
    I need to calculate the number of sessions from our Windows users
    in 2023. I've tried to find the table with such data
    in our data warehouse, but wasn't able to.
    Do you have any ideas whether we store the needed data somewhere,
    so that I can query it. """
    }
    )
    If we look at the final answers and logs (I’ve omitted them here since they are quite lengthy, but you can find them and full logs on GitHub), we will see that the manager was able to orchestrate correctly and delegate tasks to co-workers with relevant knowledge to address the customer's question. For the first (ClickHouse-related) question, we got a detailed answer with examples and possible implications of using WITH TOTALS functionality. For the data-related question, models returned roughly the same information as we’ve seen above.

    So, we’ve built a crew that can answer various types of questions based on the documentation, whether from a local file or a website. I think it’s an excellent result.

    You can find all the code on GitHub.

    Summary
    In this article, we’ve explored using the CrewAI multi-agent framework to create a solution for writing documentation based on tables and answering related questions.

    Given the extensive functionality we’ve utilised, it’s time to summarise the strengths and weaknesses of this framework.

    Overall, I find CrewAI to be an incredibly useful framework for multi-agent systems:

    It’s straightforward, and you can build your first prototype quickly.
    Its flexibility allows to solve quite sophisticated business problems.
    It encourages good practices like role-playing.
    It provides many handy tools out of the box, such as RAG and a website parser.
    The support of different types of memory enhances the agents’ collaboration.
    Built-in guardrails help prevent agents from getting stuck in repetitive loops.
    However, there are areas that could be improved:

    While the framework is simple and easy to use, it’s not very customisable. For instance, you currently can’t create your own LLM manager to orchestrate the processes.
    Sometimes, it’s quite challenging to get the full detailed information from the documentation. For example, it’s clear that CrewAI implemented some guardrails to prevent repetitive function calls, but the documentation doesn’t fully explain how it works.
    Another improvement area is transparency. I like to understand how frameworks work under the hood. For example, in Langchain, you can use langchain.debug = True to see all the LLM calls. However, I haven’t figured out how to get the same level of detail with CrewAI.
    The full support for the local models would be a great addition, as the current implementation either lacks some features or is difficult to get working properly.
    The domain and tools for LLMs are evolving rapidly, so I’m hopeful that we’ll see a lot of progress in the near future.

    Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.

    Reference
    This article is inspired by the “Multi AI Agent Systems with CrewAI” short course from DeepLearning.AI.