Mistral AI에서는 새로운 AI 어시스턴트인 ‘Le Chat’을 출시하여 웹 검색, 비전 이해, 아이디어 생성, 코딩 등 다양한 기능을 무료로 제공하고 있습니다. 또한 멀티모달 모델인 ‘Pixtral Large’를 공개하여 최첨단 성능을 달성하였습니다. Nexusflow는 GPT-4o와 경쟁할 수 있는 Athene v2 72B 모델을 발표하여 챗봇, 코드 생성, 수학 분야에서 우수한 성능을 보이고 있습니다. 북경대학교의 PKU-YuanGroup은 GPT-o1 수준의 자발적이고 체계적인 추론 능력을 갖춘 LLaVA-o1 모델을 발표하였습니다. 칭화대학교 연구팀은 Transformer의 어텐션 메커니즘을 가속화하는 4/8비트 양자화 방법인 SageAttention을 제안하였습니다. Fixie AI는 실시간 음성 처리를 지원하는 오픈 소스 멀티모달 모델 Ultravox v0.4.1을 출시하였습니다. Etched는 Transformer 아키텍처를 직접 실리콘에 구현한 ASIC 칩 ‘Sohu’를 개발하여 GPU 대비 10배 이상의 성능 향상을 이뤘습니다. 애플은 작은 언어 모델의 지식을 큰 모델로 확장하는 HyperCloning 기법을 소개하여 훈련 시간을 단축하였습니다. 마이크로소프트는 100만 개의 합성 명령어 쌍을 공개하고, LLM 기반 멀티에이전트 시뮬레이션 라이브러리인 TinyTroupe를 발표하였습니다. OpenDataLab은 PDF를 기계가 읽을 수 있는 형식으로 변환하는 MinerU 도구를 소개하였습니다. Adaptive ML은 LLM을 효과적으로 튜닝하기 위한 PPO 알고리즘에 대한 심층적인 가이드를 제공하였습니다. Harbor는 k6를 사용하여 LLM 애플리케이션의 부하 테스트를 지원하는 기능을 추가하였습니다. 메타는 인간-AI 간의 이미지 이해 대화를 위한 안전장치인 Llama Guard 3 Vision과 경량화된 Llama Guard 3-1B-INT4 모델을 발표하였습니다.

Mistral AI, 새로운 AI 어시스턴트 ‘Le Chat’ 출시

링크, 2024년 11월 18일

  • ‘Le Chat’ 출시: Mistral AI는 무료로 제공되는 생성형 AI 워크 어시스턴트 ‘Le Chat’을 출시했습니다. 이는 사용자들이 창의적인 작업, 코딩, 아이디어 생성 등을 도울 수 있는 도구입니다.
  • 웹 검색 기능 통합: ‘Le Chat’은 웹 검색 기능을 통합하여 최신 정보에 액세스할 수 있으며, 검색 결과에 대한 출처를 인용하여 신뢰성을 높였습니다.
  • 캔버스(Canvas) 기능 도입: 새로운 인터페이스인 캔버스 기능을 통해 사용자는 아이디어를 시각적으로 정리하고, 인라인 편집 및 결과물을 다양한 형식으로 내보낼 수 있습니다.
  • 이미지 생성 기능 추가: Black Forest Labs의 ‘Flux Pro’를 활용하여 고품질의 이미지 생성 기능을 제공합니다.
  • 응답 속도 향상: Speculative decoding 기술을 도입하여 응답 속도를 향상시켰으며, 이를 통해 사용자 경험을 개선했습니다.
  • 경쟁력 비교: ‘Le Chat’은 경쟁 서비스인 ChatGPT, Perplexity, Claude 등에 비해 웹 검색, 캔버스 기능, 멀티모달 이해 등에서 우위를 점하고 있습니다.

Mistral AI, 멀티모달 모델 ‘Pixtral Large’ 공개

링크, 2024년 11월 18일

  • 모델 개요: ‘Pixtral Large’는 124B 파라미터를 가진 멀티모달 모델로, Mistral Large 2를 기반으로 구축되었습니다.
  • 최첨단 성능 달성: MathVista, DocVQA, VQAv2 등 다양한 멀티모달 벤치마크에서 최고 수준의 성능을 달성했습니다.
  • 구성: 123B 파라미터의 멀티모달 디코더와 1B 파라미터의 비전 인코더로 구성되어 있습니다.
  • 컨텍스트 윈도우 확장: 128K의 컨텍스트 윈도우를 지원하여 최소 30개의 고해상도 이미지를 한 번에 처리할 수 있습니다.
  • 텍스트 성능 유지: 멀티모달 입력을 처리하면서도 텍스트 전용 모델인 Mistral Large 2의 성능을 유지하도록 설계되었습니다.
  • 오픈 소스 라이선스: Mistral Research License(MRL)로 공개되어 연구 및 교육 목적으로 사용할 수 있습니다.
  • API 및 다운로드: 모델은 API를 통해 ‘pixtral-large-latest’로 사용할 수 있으며, 직접 다운로드하여 자체 배포도 가능합니다.

Nexusflow, ‘Athene v2 72B’ 모델 발표

링크, 2024년 11월 14일

  • 모델 개요: ‘Athene v2 72B’는 Qwen 2.5 72B 모델을 기반으로 한 대형 언어 모델로, GPT-4o와 경쟁할 수 있는 성능을 갖추었습니다.
  • 성능 벤치마크: Arena Hard에서 GPT-4o(84.9점)에 근접한 77.9점을 기록하고, Bigcode-Bench Hard에서는 GPT-4o(30.8점)를 넘어서는 31.4점을 달성했습니다.
  • 수학 능력 향상: MATH 벤치마크에서 GPT-4o(76.6점)보다 높은 83점을 기록하여 수학 문제 해결 능력을 향상시켰습니다.
  • RLHF 적용: 인간의 피드백을 활용한 강화 학습(RLHF)과 정교한 데이터 파이프라인을 통해 모델의 특정 능력을 강화했습니다.
  • 심층 커스터마이제이션: 에이전트 구축 및 실제 생산 환경에서의 활용을 위해 깊은 수준의 커스터마이제이션을 지원합니다.
  • 오픈 소스 공개: 모델은 Hugging Face를 통해 공개되어 있으며, Transformers 라이브러리와 호환됩니다.

PKU-YuanGroup, ‘LLaVA-o1’ 모델 발표

링크, 2024년 11월 20일

  • 모델 특징: ‘LLaVA-o1’은 11B 파라미터를 가진 비주얼 언어 모델로, GPT-o1 수준의 자발적이고 체계적인 추론 능력을 보유하고 있습니다.
  • 성능 비교: 여섯 개의 멀티모달 벤치마크에서 Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B-Vision-Instruct를 능가하는 성능을 보여주었습니다.
  • 멀티모달 처리: 이미지와 텍스트를 동시에 처리하여 복잡한 멀티모달 작업을 수행할 수 있습니다.
  • 모델 공개: 사전 훈련된 가중치가 Hugging Face를 통해 공개되었으며, 연구자들이 자유롭게 활용할 수 있습니다.
  • 코드 및 앱 출시 예정: 코드와 Gradio 애플리케이션이 곧 공개될 예정이며, 사용자들이 직접 모델을 체험해 볼 수 있습니다.

칭화대학교, Transformer 가속화 위한 ‘SageAttention’ 제안

링크, 2024년 11월 20일

  • 양자화 기법: ‘SageAttention’은 Transformer의 어텐션 메커니즘을 가속화하기 위한 4/8비트 양자화 방법입니다.
  • API 호환성: PyTorch의 Scaled Dot-Product Attention(SDPA, 일명 Flash Attention)에 대한 드롭인 대체 API를 제공합니다.
  • 속도 향상: Flash Attention 2 대비 최대 3배의 속도 향상을 이루었으며, 원래 모델의 99%에 달하는 정확도를 유지합니다.
  • 양자화 적용 방식: Q와 K 행렬에 INT4/8 양자화를 적용하고, P와 V에는 FP8/16을 적용하여 연산 효율을 높였습니다.
  • 동적 양자화: 시간 단계와 레이어에 따라 양자화 수준을 동적으로 조절하는 메커니즘을 도입하여 성능을 최적화했습니다.
  • 코드 공개: ‘SageAttention 1’의 코드는 이미 공개되어 있으며, ‘SageAttention 2’의 코드는 곧 공개될 예정입니다.

Fixie AI, 실시간 음성 처리 모델 ‘Ultravox v0.4.1’ 출시

링크, 2024년 11월

  • 모델 개요: ‘Ultravox v0.4.1’은 음성 입력을 직접 처리하여 텍스트를 생성하는 멀티모달 LLM입니다.
  • 모델 구조: Whisper 인코더와 Llama 3.1 등의 LLM을 디코더로 사용하여 별도의 음성 인식 단계 없이 음성을 직접 처리합니다.
  • 성능 비교: Llama 3.1 70B 버전은 CoVoST-2 벤치마크에서 OpenAI GPT-4o와 경쟁하는 성능을 보여줍니다.
  • 실시간 처리 능력: A100-40GB GPU에서 Llama 3.1 8B 모델 기준으로 초당 약 50~100개의 토큰을 생성하며, 첫 토큰 응답 시간은 약 150ms입니다.
  • 다국어 지원: 15개 이상의 언어를 지원하며, 음성-텍스트 변환뿐만 아니라 향후 음성-음성 변환도 지원할 예정입니다.
  • 오픈 소스 및 라이선스: 모델과 코드는 MIT 라이선스로 Hugging Face에 공개되어 있으며, 누구나 자유롭게 사용할 수 있습니다.
  • 데모 및 API 제공: 실시간 데모를 제공하며, 관리형 API를 통해 음성, 도구 호출, 전화 통신 등을 지원합니다.

Etched, Transformer 전용 ASIC 칩 ‘Sohu’ 개발

링크, 2024년 6월 25일

  • ASIC 칩 ‘Sohu’ 발표: Etched는 Transformer 아키텍처를 직접 실리콘에 구현한 ASIC 칩 ‘Sohu’를 개발하여 발표했습니다.
  • 성능 향상: GPU 대비 10배 이상의 속도와 비용 효율성을 달성하였으며, Llama 70B 모델에서 초당 500,000 토큰 이상의 처리 속도를 지원합니다.
  • 전용 아키텍처: ‘Sohu’는 Transformer에 특화된 설계로, CNN, RNN 등 다른 모델은 실행할 수 없지만, 모든 최신 Transformer 기반 모델을 지원합니다.
  • 하드웨어 사양: 칩당 144GB의 메모리를 갖추고 있어 최대 100조 개의 파라미터를 가진 모델까지 처리할 수 있습니다.
  • 고급 기능 내장: 스펙큘레이티브 디코딩, 빔 서치, 몬테카를로 트리 서치(MCTS), 전문가의 혼합(MoE) 등 고급 기능을 하드웨어 수준에서 지원합니다.
  • 제작 및 파트너십: TSMC의 4nm 공정을 사용하여 제작되었으며, 대량 생산을 위한 공급망과 파트너십을 구축했습니다.
  • 오픈 소스 소프트웨어: 완전한 오픈 소스 소프트웨어 스택을 제공하여 개발자들이 자유롭게 활용하고 커스터마이즈할 수 있습니다.

애플, 작은 모델로 큰 모델 초기화하는 ‘HyperCloning’ 기법 발표

링크, 2024년 9월 20일

  • ‘HyperCloning’ 소개: 작은 언어 모델의 파라미터를 큰 모델로 확장하여 초기화하는 혁신적인 기법을 발표했습니다.
  • 훈련 시간 단축: 큰 언어 모델의 사전 훈련 시간을 최대 2~4배 가속화하여 비용과 시간을 절감할 수 있습니다.
  • 지식 이전: 작은 모델의 지식과 기능을 큰 모델로 이전하여 초기부터 높은 정확도를 달성할 수 있습니다.
  • 기술 상세: 선형 계층의 파라미터를 확장하고, 하이퍼네트워크를 활용하여 큰 모델의 파라미터를 초기화합니다.
  • 지원 모델: OPT, Pythia, OLMo, Gemma, Llama 등 다양한 모델 아키텍처에 적용 가능하며, 호환성이 높습니다.
  • 코드 및 라이브러리: 코드가 공개되어 있으며, 연구자들이 쉽게 적용하고 실험할 수 있습니다.

마이크로소프트, 100만 개의 합성 명령어 쌍 공개

링크, 2024년 11월 15일

  • 데이터셋 개요: AgentInstruct 프레임워크를 사용하여 생성된 100만 개의 합성 명령어 쌍을 공개했습니다.
  • 합성 데이터 특징: 프롬프트와 응답 모두가 인공적으로 생성된 완전한 합성 데이터셋으로, 웹에서 공개적으로 이용 가능한 텍스트를 시드로 사용했습니다.
  • 다양한 기능 포함: 텍스트 편집, 창의적 글쓰기, 코딩, 독해 등 다양한 작업을 포괄하며, 명령어 튜닝에 활용될 수 있습니다.
  • 모델 성능 향상: 이 데이터셋을 사용하여 Mistral-7b 모델을 후훈련한 결과, AGIEval, MMLU, GSM8K, BBH, AlpacaEval 등 여러 벤치마크에서 최대 54%의 성능 향상을 확인했습니다.
  • 사용 및 라이선스: 연구 및 교육 목적으로 사용 가능하며, Hugging Face를 통해 데이터셋에 접근할 수 있습니다.

마이크로소프트, LLM 기반 멀티에이전트 시뮬레이션 ‘TinyTroupe’ 발표

링크, 2024년 11월 15일

  • ‘TinyTroupe’ 소개: LLM을 활용한 멀티에이전트 시뮬레이션 라이브러리로, 다양한 성격과 목표를 가진 인공 에이전트를 생성할 수 있습니다.
  • 시뮬레이션 기능: 에이전트들은 서로 상호작용하며, 광고 평가, 소프트웨어 테스트, 제품 기획 등 다양한 시나리오를 시뮬레이션할 수 있습니다.
  • 사용 사례: 광고 효과 예측, 제품 아이디어 브레인스토밍, 고객 인터뷰 시뮬레이션 등 비즈니스 인사이트를 얻기 위한 도구로 활용될 수 있습니다.
  • 개발자 친화성: 파이썬으로 작성되어 있으며, 사용자 정의가 용이하고, 개발자들이 쉽게 확장하여 사용할 수 있습니다.
  • 오픈 소스 및 커뮤니티: 초기 버전으로 공개되어 있으며, 사용자들의 피드백과 기여를 통해 발전시키고자 합니다.

OpenDataLab, PDF를 변환하는 ‘MinerU’ 도구 출시

링크

  • 도구 개요: ‘MinerU’는 PDF 문서를 기계가 읽을 수 있는 형식으로 변환하는 오픈 소스 도구입니다.
  • 텍스트 추출 최적화: 헤더, 푸터, 페이지 번호, 각주 등을 제거하여 의미 있는 텍스트를 추출하고, 텍스트의 읽기 순서를 보존합니다.
  • 구조 보존: 문서의 구조를 유지하여 제목, 단락, 목록 등 원본 문서의 형식을 보존합니다.
  • 이미지 및 수식 처리: 이미지, 이미지 설명, 표, 표 제목, 각주 등을 추출하며, 수식은 LaTeX 형식으로 변환합니다.
  • 표 변환: 표는 LaTeX 또는 HTML 형식으로 변환하여 활용도를 높였습니다.
  • OCR 지원: 스캔된 PDF나 손상된 PDF를 처리하기 위한 OCR 기능을 제공하며, 84개 언어를 지원합니다.
  • 출력 형식 다양화: 마크다운, JSON 등 다양한 출력 형식을 지원하여 사용자의 필요에 따라 활용할 수 있습니다.
  • 시각화 기능: 레이아웃 시각화 및 스팬 시각화를 통해 출력 결과의 품질을 쉽게 확인할 수 있습니다.
  • 플랫폼 호환성: CPU와 GPU 환경 모두에서 작동하며, Windows, Linux, Mac 등 다양한 플랫폼을 지원합니다.
  • Gradio 앱 제공: Gradio 기반의 애플리케이션 인터페이스를 제공하여 사용자가 직접 도구를 체험할 수 있습니다.

Adaptive ML, LLM 튜닝을 위한 PPO 알고리즘 가이드 제공

링크, 2024년 11월 19일

  • 가이드 개요: LLM의 도움말 생성 능력을 향상시키기 위한 PPO(Proximal Policy Optimization) 알고리즘에 대한 심층적인 가이드를 제공했습니다.
  • 학습 과정 설명: Supervised Fine-tuning(SFT)에서 시작하여 REINFORCE, Advantage Actor-Critic(A2C), PPO로 이어지는 단계별 학습 과정을 상세히 설명합니다.
  • RLHF 활용: 인간의 피드백을 활용한 강화 학습(RLHF)을 통해 LLM의 유용성, 무해성, 정직성을 향상시키는 방법을 소개합니다.
  • PPO의 원리: PPO 알고리즘의 작동 원리와 정책 모델(Actor), 가치 함수 모델(Critic), 보상 모델 간의 상호작용을 기술적으로 설명합니다.
  • 실전 적용 방법: 실제로 PPO를 활용하여 LLM을 튜닝하는 방법과 주의해야 할 사항을 제시합니다.
  • 추가적인 고려사항: 오버피팅 방지, 보상 모델의 한계, 정책 이동의 제약 등 현실적인 문제에 대한 해결책을 논의합니다.

Harbor, k6를 통한 LLM 부하 테스트 지원 추가

링크, 2024년 11월 19일

  • 부하 테스트 도구 통합: Harbor는 부하 테스트 도구인 k6에 OpenAI 호환 클라이언트를 추가하여 LLM 애플리케이션의 부하 테스트를 지원합니다.
  • 성능 모니터링: 사전 구성된 InfluxDB와 Grafana 대시보드를 제공하여 테스트 중인 애플리케이션의 성능을 실시간으로 모니터링할 수 있습니다.
  • OpenAI 호환성: OpenAI 호환 API 클라이언트 및 사용자 지정 데이터셋에 대한 내장 도우미를 제공하여 테스트 스크립트를 쉽게 작성할 수 있습니다.
  • 호환 모델: Hugging Face의 Text Generation Inference(TGI), vLLM, Ollama 등과 호환되며, 다양한 LLM 서비스의 성능을 비교 및 테스트할 수 있습니다.
  • 예제 스크립트 제공: 빠른 시작을 위해 예제 스크립트와 엔드포인트를 제공하여 사용자가 쉽게 테스트를 시작할 수 있습니다.
  • 독립적인 설정: Docker Compose를 사용하여 독립적으로 환경을 설정할 수 있으며, 코드와 문서가 GitHub를 통해 공개되어 있습니다.

메타, 멀티모달 안전장치 ‘Llama Guard 3 Vision’ 발표

링크, 2024년 11월 20일

  • 안전장치 소개: ‘Llama Guard 3 Vision’은 멀티모달 대화에서 안전성을 보장하기 위한 LLM 기반의 안전장치입니다.
  • 이미지 이해 대화 지원: 이미지와 텍스트를 모두 처리하여 멀티모달 입력에 대한 유해 콘텐츠를 감지하고 차단합니다.
  • 프롬프트 및 응답 분류: 멀티모달 LLM 입력(프롬프트)과 출력(응답)에 대한 유해성 분류를 수행합니다.
  • 벤치마크 성능: MLCommons의 분류 체계를 사용하여 내부 벤치마크에서 우수한 성능을 달성하였습니다.
  • 강인성 테스트: 적대적 공격에 대한 강인성을 테스트하여 모델의 안정성과 신뢰성을 확인하였습니다.
  • 멀티모달 안전성 연구 기여: 멀티모달 AI 시스템에서의 안전한 인간-AI 대화를 지원하기 위한 기반을 마련하였습니다.

메타, 경량화된 안전장치 모델 ‘Llama Guard 3-1B-INT4’ 공개

링크, 2024년 11월 20일

  • 모델 개요: ‘Llama Guard 3-1B-INT4’는 INT4 양자화를 적용한 1B 파라미터의 경량화된 안전장치 모델입니다.
  • 성능 및 효율성: 안드로이드 모바일 CPU에서 초당 30 토큰 이상의 처리 속도와 2.5초 이하의 첫 토큰 응답 시간을 달성하여, 자원 제한적인 환경에서도 효율적으로 동작합니다.
  • 안전성 유지: 모델 크기가 약 7배 감소(440MB)했음에도 불구하고, 기존 Llama Guard 3-1B와 동등하거나 더 나은 안전성 분류 성능을 제공합니다.
  • 배포 용이성: 작은 모델 크기와 낮은 연산 요구량으로 모바일 기기나 임베디드 시스템 등 다양한 플랫폼에서 쉽게 배포할 수 있습니다.
  • 오픈 소스 공개: 모델은 오픈 소스로 공개되어 있으며, 연구자와 개발자들이 활용하여 안전한 AI 대화 시스템을 구축할 수 있습니다.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
###
https://mistral.ai/news/mistral-chat/
Mistral has entered the chat
Search, vision, ideation, coding… all yours for free.

November 18, 2024 Mistral AI team
AI is as much a pursuit of seamless interaction between human and machine intelligence, as it is of frontier science that makes machines reason better. At Mistral AI, we began our journey by developing state of the art models; we have now begun to invest in AI interfaces, starting with some big updates to le Chat.

What’s New
If you’re new to le Chat, welcome! Le Chat [/lə ʃa/, French: ‘the cat’] is the free generative AI work assistant from Mistral AI. You can use le Chat as your creative partner, teach it to perform specific tasks, use it as a coding copilot… basically, use it to get stuff done.

Today, we’re giving le Chat new powers (in beta). Here’s a quick summary:

Web search with citations
Canvas for ideation, in-line editing, and export
State of the art document and image understanding, powered by the new multimodal Pixtral Large
Image generation, powered by Black Forest Labs Flux Pro
Fully integrated offering, from models to outputs
Faster responses powered by speculative editing
Here’s how le Chat compares to leading chat and search assistants in the market today:

Feature ChatGPT Perplexity Claude Le Chat
Access to latest frontier model Paid Yes Yes Yes
Web search with citations Yes Yes No Yes
Canvas for ideation, in-line editing, and export Paid No Limited Yes
Image generation Yes No No Yes
Image / document analysis Yes Paid Yes Yes
Task agents Paid No No Yes
Single platform from models to outputs Yes No Yes Yes
Let’s dive in.

Web search
A significant share of le Chat usage comes from students and professionals who use it for learning, research, and work. To help better facilitate these use cases, we’ve augmented le Chat’s training knowledge with internet search.

Here’s one of our marketing teammates using it to assess the AI market for healthcare.


With state of the art language models that have access to the latest knowledge and happenings, you have a powerful tool to level up your work.

Canvas
Canvas is a new interface that pops up in your le Chat window when you need to go beyond conversations and into ideation. It enables you to collaborate with Mistral LLMs on shared outputs, enabling 0-1 creation and ideation for knowledge work.

No longer restricted to bidirectional conversations, Canvas enables you to directly modify, edit or transform content aided by the powerful reasoning capabilities of LLMs.

Here’s a video of our product team creating a react mockup of a product metrics dashboard in preparation for the launch of Canvas.


You can use Canvas to create documents, presentations, code, mockups… the list goes on. As you can see, you’re able to modify its contents in place without regenerating responses, version your drafts, and preview your designs.

Canvas allows for infinite creative ideation for any team.

Fast and accurate document understanding
Le Chat is now able to process large, complex PDF documents and images, for analysis and summarization. Simply attach documents to your chat bar and get insights on anything—graphs, tables, diagrams, text, formulae, equations… basically, the entirety of the document.

These advanced capabilities are made possible by the new Pixtral Large, a state of the art multimodal model that is significantly more powerful than our previous vision model, Pixtral 12B, in conjunction with an experimental model (more on that in a few weeks).

In the example below, we show the extraction of information, summarization, and semantic understanding of the famous quantum entanglement paper from 1935 authored by Einstein, Podolsy, and Rosen.


Image generation
We have partnered with our friends at Black Forest Labs to bring their leading image generation models to le Chat. You can now generate high quality images for your work right from within le Chat.

Le Chat Agents to automate repetitive workflows
If you find yourself reusing prompts, or if you’ve identified certain patterns of creative work that you prefer to be automated, you can codify these workflows as agents. Popular use cases include receipt scanning for expense reporting, creating TL;DR’s of long meeting minutes, and invoice processing.

Agents can be published and shared with your teammates. You can ideate recursively, and build increasingly powerful agents that help you get your work done faster. For more on agents, check out this blog or our documentation.

How much does all of this cost?
Currently, all of these features are introduced as free beta offerings. At Mistral AI, our approach to AI is different – we’re not chasing AGI at all costs; our mission is to instead place frontier AI in your hands, so you get to decide what to do with advanced AI capabilities. This approach has allowed us to be quite frugal with our capital, while consistently delivering frontier capabilities at affordable price points. On le Chat, we’re offering a generous free tier with these beta features and working towards premium tiers with higher service guarantees.

Try it
Head over to chat.mistral.ai to try le Chat. If your current AI-productivity toolchain is siloed or expensive, le Chat will give you an alternative to do your best work. All new features will be rolled out gradually over the next few weeks—be sure to let us know what you think!
You know us for our models, and starting today, we hope you will also enjoy our AI assistant! We're proud to introduce the next generation of le Chat with search, canvas, document analysis, coding... all in one place. We're excited for you to try it at chat.mistral.ai.

Great day for LLMs, Mistral AI just released Mistral & Pixtral Large - ~123B, 128K context, Multilingual, JSON + Function calling support & open weights! 🔥
Pixtral Large:
> SoTA on MathVista, DocVQA, VQAv2
> Extends Mistral Large 2 123B multimodal decoder, 1B parameter vision encoder
> 128K context window: fits minimum of 30 high-resolution images
Mixtral Large:
> Dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch and Polish
> Trained on 80+ coding languages such as Python, Java, C, C++, Javascript, and Bash
> SoTA Agent + MATH capabilities
> 128k context window + maintains strong adherence and support for more reliable system prompts
Transformers integration on its way!
Congrats Mistral AI on the release! 🤗

###
https://mistral.ai/news/pixtral-large/
Pixtral Large
Pixtral grows up.

November 18, 2024 Mistral AI team
Pixtral Large in short:
Frontier-class multimodal performance
State-of-the-art on MathVista, DocVQA, VQAv2
Extends Mistral Large 2 without compromising text performance
123B multimodal decoder, 1B parameter vision encoder
128K context window: fits minimum of 30 high-resolution images
Use:
Try it on le Chat
In our API as pixtral-large-latest
Download it here
Today we announce Pixtral Large, a 124B open-weights multimodal model built on top of Mistral Large 2. Pixtral Large is the second model in our multimodal family and demonstrates frontier-level image understanding. Particularly, the model is able to understand documents, charts and natural images, while maintaining the leading text-only understanding of Mistral Large 2.

The model is available under the Mistral Research License (MRL) for research and educational use; and the Mistral Commercial License for experimentation, testing, and production for commercial purposes.

Benchmark summary
Performance
We evaluate Pixtral Large against frontier models on a set of standard multimodal benchmarks, through a common testing harness. On MathVista, which evaluates complex mathematical reasoning over visual data, our model achieves 69.4%, outperforming all other models. To assess reasoning capabilities over complex charts and documents, we evaluate performance using ChartQA and DocVQA, where Pixtral Large surpasses GPT-4o and Gemini-1.5 Pro.

Finally, Pixtral Large demonstrates competitive capabilities on MM-MT-Bench, outperforming all of Claude-3.5 Sonnet (new), Gemini-1.5 Pro and GPT-4o (latest). MM-MT-Bench is an open-source, judge-based evaluation intended to reflect real-world use cases of multimodal LLMs (see the Pixtral 12B technical report for details).

Detailed benchmarks
Qualitative Samples
Multilingual OCR and reasoning…
German receipt
Prompt:

I bought the coffee and the sausage. How much do I owe? Add a 18% tip.

Response:

To calculate the total amount you owe, including the 18% tip, follow these steps:

Calculate the cost of the coffee and the sausage:

2x Latte Macchiato: 2 * 4.50 CHF = 9.00 CHF
1x Schweinschnitzel: 22.00 CHF
Total cost for coffee and sausage: 9.00 CHF + 22.00 CHF = 31.00 CHF
Calculate the 18% tip:

Tip: 31.00 CHF * 0.18 = 5.58 CHF
Add the tip to the total cost:

Total amount owed: 31.00 CHF + 5.58 CHF = 36.58 CHF
So, you owe a total of 36.58 CHF.

Chart understanding…
Loss curve
Prompt:

When did things start to go wrong for dark dragon?

Response:

The training loss for the “dark-dragon-50” model starts to exhibit significant instability around the 10,000 steps mark. There is a noticeable spike in the loss value at this point, and the instability continues with another major spike around the 20,000 steps mark. These spikes indicate that the training process for the “dark-dragon-50” model began to encounter issues starting around the 10,000 steps mark.

And more…
Website screenshot
Prompt:

Which companies use models made by Mistral AI?

Response:

The companies that use models made by Mistral AI, as shown in the image, are:

BNP Paribas
Brave
Cloudflare
CMA CGM
Front
One more thing…
Along with Pixtral Large, Mistral Large, our state-of-the-art text model, also gets an update. The model is available as pixtral-large-latest on our API, as well as for self-deployment as Mistral Large 24.11 on HuggingFace under the Mistral Research License (MRL) for research, or with a commercial license from Mistral AI for commercial use.

This newest model provides a significant upgrade on the previous Mistral Large 24.07, with notable improvements in long context understanding, a new system prompt, and more accurate function calling. The model is highly capable for RAG and agentic workflows, making it a suitable choice for enterprise use cases such as knowledge exploration and sharing, semantic understanding of documents, task automation, and improved customer experiences.

Mistral Large 24.11 will be available from our cloud provider partners soon, starting with Google Cloud and Microsoft Azure within a week.


###
https://nexusflow.ai/blogs/athene-v2
Nexusflow released Athene v2 72B - competetive with GPT4o & Llama 3.1 405B Chat, Code and Math 🔥
> Arena Hard: GPT4o (84.9) vs Athene v2 (77.9) vs L3.1 405B (69.3)
> Bigcode-Bench Hard: GPT4o (30.8) vs Athene v2 (31.4) vs L3.1 405B (26.4)
> MATH: GPT4o (76.6) vs Athene v2 (83) vs L3.1 405B (73.8)
> Models on the Hub along and work out of the box w/ Transformers 🤗

Open science ftw! ⚡
By Nexusflow | Nov 14, 2024

We’re thrilled to announce Athene-V2, our latest 72B model suite. Fine-tuned from Qwen 2.5 72B, Athene-V2 competes with GPT-4o across key capabilities, powered by a meticulously designed data and RLHF pipeline. As the industry recognizes the slow-down of scaling law—where increasing model size alone no longer delivers universal capability improvements—there’s a growing need for specialized customization to enhance specific capabilities. Our post-training process illustrates this shift, demonstrating how our data and tuning solutions allow us to finely optimize for distinct skills and use cases.

Here’s a look at the unique specializations that position Athene-V2 models along the Pareto frontier of LLM post-training:

Athene-V2-Chat-72B: A state-of-the-art chat model, matching GPT-4o across multiple benchmarks. It outperforms GPT-4o in chat helpfulness (Arena-Hard), excels in code completion (ranking #2 on bigcode-bench-hard), mathematics (MATH), and handles long log extraction with higher precision (our internal benchmark).
Athene-V2-Agent-72B: Striking a balance between chat and agent capabilities, this model offers concise, directive chat responses, surpassing GPT-4o in our latest Nexus-V2 function calling benchmarks that focus on hard enterprise-level function calling use cases.

Table 1. Benchmark results for Athene-V2 model suite and popular open and proprietary models.
The Pareto Frontier of LLM Post-training
In LLM post-training, it's common to expect universal improvements across all tracked metrics as the model is trained on more high-quality data. However, our observations reveal that as models approach a certain "Pareto frontier"—the point where a balance between multiple performance metrics is optimized—achieving further improvement requires a strategic shift. Beyond this point, the most effective way to realize substantial gains is by refining specific capabilities, trading off certain aspects for focused enhancements along the frontier. This approach enables us to achieve targeted, meaningful improvements rather than universal changes across all metrics.

A good example of such a trade-off is the progression of GPT-4. Initially, some users perceived GPT-4-0613 as a regression from GPT-4-0314, despite improvements made by OpenAI based on user feedback—showcasing the trade-off dynamic at play. This research paper is studying and tracking these changes. Similarly, we’re seeing selective customization efforts, such as Harvey’s collaboration with OpenAI to tailor models for legal applications, among other domains.

We also observe a similar trend in our post-training processes. As is shown in Figure 2, the quality of post-training data and RLHF strategies defines a hidden Pareto frontier that governs the balance between chat and agent capabilities. Customization using the state-of-the-art post-training pipelines allows movement along this frontier, rather than beyond it. For example, the Athene-V2-Agent model emphasizes agent-oriented capabilities, sacrificing slightly general chat flexibility, while the Athene-V2-Chat model excels in dialogue yet shows some limitations on agent-related tasks.


Figure 2. The post-training pipeline of Athene-V2 pushes Qwen 2.5 into a better pareto frontier for chat and agent capabilities, matching GPT-4o across benchmarks.
Building AI Agents in Production Requires Deeper Customization
Deploying production-ready agents demands deeper customization than what standard benchmarks can measure. Traditional benchmarks often fall short because they can't fully capture model performance within complex, real-world systems. Instead, actionable insights emerge from analyzing system execution results holistically. For instance, examining precision-recall tradeoffs highlights how customization enhances model effectiveness.

Excelling in these challenging metrics requires even deeper customization than the current approaches used in our Athene V2 pipeline. Nexusflow provides the expertise and tools to further optimize agents for real-world complexities, unlocking full potential in production settings.

As a concrete example, consider a ticket management and search system with 200 filter options including customer name, category, location, time, priority, access, urgency, and status. Feeding all options to a model may cause over-filtering and empty search results. As a concrete example in Figure 3, a request for "urgent customer threads in SF which require manager approval" might trigger both urgency and priority filters incorrectly, despite being prompted to treat them differently.


Figure 3. Example of queries for a ticket management system
This over-filtering behavior requires balancing the model's precision and recall – reducing unrelated filter triggers while focusing on relevant ones. As is shown in Figure 4, fine-tuning on carefully curated datasets to balance precision and recall can move the model to a regime with higher overall accuracy, and usually achieves better results than pure prompt tuning. The results are directly reported in our search-FC benchmark numbers in Table 5, which show a significant gap between our specially customized Athene model and other models. Interestingly, by further tuning on more function calling data, we also observe higher gains in agent performance.


Figure 4. Deep customization helps move the starting model to a regime with better precision-recall tradeoff.

Table 5. Deep customization for precision-recall tradeoff improves greatly for the search system we discussed above.
Based on our engagements with enterprise customers, we believe that deep customization and continuous learning from user interactions are essential to improving agent model quality. We provide tuning recipes and pipelines to our enterprise partners, enabling them to build robust agent models tailored to their specific systems with continuous quality improvement. Contact our team for a demo!

###
https://github.com/PKU-YuanGroup/LLaVA-o1
11/20/24
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!
🤯 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks.
LlaVA-o1
Stay tuned for the code and gradio app release.

If you like our project, please give us a star ⭐ on GitHub for the latest update.
arXiv X (formerly Twitter) URL License Hits

📣 News
[2024/11/20] We've released the pretrained weights: [https://huggingface.co/Xkev/Llama-3.2V-11B-cot]
[2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
[2024/11/18] Welcome to watch 👀 this repository for the latest updates.
🔥 Highlights
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

###
https://github.com/thu-ml/SageAttention/tree/main
11/20/24
Tsinghua Statistical Artificial Intelligence & Learning Group

Sage Attention the next Flash Attention? SageAttention is a 4/8-bit quantization method designed to accelerate the attention mechanism in transformers with drop-in replacement API to Pytorch SDPA (Flash Attention)! 👀
> 3x speed up over Flash Attention2 while maintaining 99% performance
> INT4/8 for Q and K matrices, and FP8/16 for P and V + smoothing methods for Q and V matrices
> Dynamically adjusts quantization across timesteps and layers
> Drop-in replacement of torch scaled_dot_product_attention
> SageAttention 1 code released, 2 to be released soon


###
https://github.com/fixie-ai/ultravox/
A fast multimodal LLM for real-time voice
Ultravox

2024/11 — Ultravox 0.4.1 available
2024/08 — Ultravox 0.4 available
2024/08 — Ultravox 0.3 available
2024/08 — Preview of Ultravox APIs available, more information here
About
Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research like AudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.

The current version of Ultravox (v0.4), when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms, and a tokens-per-second rate of ~60 using a Llama 3.1 8B backbone. While quite fast, we believe there is considerable room for improvement in these numbers.

Ultravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.

Demo
See Ultravox in action on our demo page.

Discord
Join us on our Discord server here.

Jobs
If you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs page here.

Inference Server
You can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen: https://www.baseten.co/library/ultravox/. They offer free credits to get started.

If you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to those here.

Model
You can download the latest weights from the Ultravox Hugging Face page.
Ready for real-time conversations with AI? A new open multimodal model, Ultravox v0.4.1, just got released. Ultravox uses a Whisper encoder and LLMs as a decoder like Meta Llama 3.1. Currently, It can receive audio and generate text! The Llama 3.1 70B version is competitive with OpenAI GPT-4o on CoVoST-2! 👀
TL;DR:
🧠 Uses Whisper as an Audio Encoder and LLM (Mistral, Llama) as a Generator
💪 Llama 3.1 70B version claims to be competitive with GPT-4o
🌐 Support for 15 languages for Llama 3.1 backbones
📚 Offer Managed service with built-in support for voices, tool calling, telephony, and more.
⚡Time-to-first-token (TTFT) of ~150ms and tokens-per-second rate of ~50-100 on A100-40 GB GPU with Llama 3.1 8B backbone
🎧 Combined with audio generator for speech-to-speech use cases, e.g. see demo
💰 Managed service at only $0.05/min (30 mins free trial)
🛠️ Trainable on any open-weight LLM with code available on GitHub
🤗 Released on Hugging Face with MIT license
Realtime demo:
https://demo.ultravox.ai/
Models:
https://lnkd.in/e6aGs5sy
Code:
https://lnkd.in/eKzVF6vc


###
https://www.etched.com/

The world's first chip that burns transformer architecture directly into silicon, delivering a 10x improvement in speed and cost-efficiency over GPUs. The chip enables processing speeds exceeding 500,000 tokens per second.
The chip includes such features as speculative decoding, beam search, Monte Carlo Tree Search (MCTS), and Mixture of Experts (MoE).
- 144 GB of memory per chip, scalable to models with up to 100 trillion parameters,
- Fully open-source software stack
- Throughput surpasses NVIDIA 8xH100 and 8xB200 for LLaMA 70B

June 25, 2024
Etched is Making the Biggest Bet in AI

In 2022, we made a bet that transformers would take over the world.

We’ve spent the past two years building Sohu, the world’s first specialized chip (ASIC) for transformers (the “T” in ChatGPT).

By burning the transformer architecture into our chip, we can’t run most traditional AI models: the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2. We can’t run CNNs, RNNs, or LSTMs either.

But for transformers, Sohu is the fastest chip of all time. It’s not even close.

With over 500,000 tokens per second in Llama 70B throughput, Sohu lets you build products impossible on GPUs. Sohu is an order of magnitude faster and cheaper than even NVIDIA’s next-generation Blackwell (B200) GPUs.

Today, every state-of-the-art AI model is a transformer: ChatGPT, Sora, Gemini, Stable Diffusion 3, and more. If transformers are replaced by SSMs, RWKV, or any new architecture, our chips will be useless.

But if we’re right, Sohu will change the world. Here’s why we took this bet.

Scale is all you need for superintelligence
In five years, AI models became smarter than humans on most standardized tests.

How? Because Meta used 50,000x more compute to train Llama 400B (2024 SoTA, smarter than most humans) than OpenAI used on GPT-2 (2019 SoTA).

By feeding AI models more compute and better data, they get smarter. Scale is the only trick that’s continued to work for decades, and every large AI company (Google, OpenAI / Microsoft, Anthropic / Amazon, etc.) is spending more than $100 billion over the next few years to keep scaling. We are living in the largest infrastructure buildout of all time.


OpenAI predicted GPT-4’s performance with scaling laws and expects scaling laws to hold for GPT-5+3.

"I think [we] can scale to the $100B range, … we’re going to get there in a few years"
- Dario Amodei, Anthropic CEO1
"Scale is really good. When we have built a Dyson Sphere around the sun, we can entertain the discussion that we should stop scaling, but not before then"
- Sam Altman, OpenAI CEO2
Scaling the next 1,000x will be very expensive. The next-generation data centers will cost more than the GDP of a small nation. At the current pace, our hardware, our power grids, and pocketbooks can’t keep up.

We’re not worried about running out of data. Whether via synthetic data, annotation pipelines, or new AI-labeled data sources, we think the data problem is actually an inference compute problem. Mark Zuckerberg4, Dario Amodei5, and Demis Hassabis6 seem to agree.
GPUs are hitting a wall
Santa Clara’s dirty little secret is that GPUs haven’t gotten better, they’ve gotten bigger. The compute (TFLOPS) per area of the chip has been nearly flat for four years.


From 2022-2025, AI chips haven’t really gotten better, they’ve gotten bigger. NVIDIA’s B200, AMD’s MI300, Intel’s Gaudi 3, and Amazon’s Trainium2 count two chips as one card to “double” performance. All GPU performance gains from 2022-2025 use this trick, except Etched.

With Moore’s law slowing, the only way to improve performance is to specialize.‍

Specialized chips are inevitable
Before transformers took over the world, many companies built flexible AI chips and GPUs to handle the hundreds of various architectures. To name a few:

NVIDIA’s GPUs
Google’s TPUs
Amazon’s Trainium
AMD’s accelerators
Graphcore’s IPUs
SambaNova SN Series
Cerebras’s CS-2
Groq’s GroqNode
Tenstorrent’s Grayskull
D-Matrix’s Corsair
Cambricon’s Siyuan
Intel’s Gaudi
No one has ever built an algorithm-specific AI chip (ASIC). Chip projects cost $50-100M and take years to bring to production. When we started, there was no market.

Suddenly, that’s changed:

Unprecedented Demand: Before ChatGPT, the market for transformer inference was ~$50M, and now it’s billions. All big tech companies use transformer models (OpenAI, Google, Amazon, Microsoft, Facebook, etc.).
Convergence on Architecture: AI models used to change a lot. But since GPT-2, state-of-the-art model architectures have remained nearly identical! OpenAI’s GPT-family, Google’s PaLM, Facebook’s LLaMa, and even Tesla FSD are all transformers.
When models cost $1B+ to train and $10B+ for inference, specialized chips are inevitable. At this scale, a 1% improvement would justify a $50-100M custom chip project.

In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin.

With billions of dollars on the line, the same will happen for AI.



Transformers are shockingly similar: tweaks like SwiGLU activations and RoPE encodings are used everywhere: LLMs, embedding models, image inpainting, and video generation.

While GPT-2 to Llama-3 are state-of-the-art (SoTA) models five years apart, they have nearly identical architectures. The only major difference is scale.

Transformers have a huge moat
We believe in the hardware lottery: the models that win are the ones that can run the fastest and cheapest on hardware. Transformers are powerful, useful, and profitable enough to dominate every major AI compute market before alternatives are ready:

Transformers power every large AI product: from agents to search to chat. AI labs have spent hundreds of millions of dollars in R&D to optimize GPUs for transformers. The current and next-generation state-of-the-art models are transformers.
As models scale from $1B to $10B to $100B training runs in the next few years, the risk of testing new architectures skyrockets. Instead of re-testing scaling laws and performance, time is better spent building features on top of transformers, such as multi-token prediction.
Today’s software stack is optimized for transformers. Every popular library (TensorRT-LLM, vLLM, Huggingface TGI, etc.) has special kernels for running transformer models on GPUs. Many features built on top of transformers aren’t easily supported in alternatives (ex. speculative decoding, tree search).
Tomorrow’s hardware stack will be optimized for transformers. NVIDIA’s GB200s have special support for transformers (TransformerEngine). ASICs like Sohu entering the market mark the point of no return. Transformer killers will need to run on GPUs faster than transformers run on Sohu. If that happens, we’ll build an ASIC for that too!


Meet Sohu
Sohu is the world’s first transformer ASIC. One 8xSohu server replaces 160 H100 GPUs.


By specializing, Sohu gets unprecedented performance. One 8xSohu server can serve over 500,000 Llama 70B tokens per second.

Benchmarks are for Llama-3 70B in FP8 precision: no sparsity, 8x model parallel, 2048 input/128 output lengths.

8xH100s calculated with TensorRT-LLM 0.10.08 (latest version), 8xGB200 figures are estimated.

Sohu only supports transformer inference, whether Llama and Stable Diffusion 3. Sohu supports all of today’s models (Google, Meta, Microsoft, OpenAI, Anthropic, etc.) and can handle tweaks to future models.

Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have many more math blocks. As a result, Sohu boasts over 90% FLOPS utilization (compared to ~30% on a GPU7 with TRT-LLM).

How can we fit so much more FLOPS on our chip than GPUs?
The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity9. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die10).

Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:

It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has 4 × 8 × 16 FMA circuits11. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.

But an H100 has 80 billion transistors12! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!

This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.

By only running transformers, we can fit way more FLOPS on our chip without resorting to lower precisions or sparsity.


Isn’t inference bottlenecked on memory bandwidth, not compute?
Actually, for modern models like Llama-3, no!

Let’s use NVIDIA and AMD’s standard benchmark13: 2048 input tokens and 128 output tokens. Most AI products have much longer prompts than completions (even a new Claude chat has 1,000+ tokens in the system prompt).

On GPUs and on Sohu, inference is run in batches. Each batch loads all of the model weights once, and re-uses them across every token in the batch. Generally, LLM inputs are compute-bound, and LLM outputs are memory-bound. When we combine input and output tokens with continuous batching, the workload becomes very compute bound14.

Below is an example of continuous batching for an LLM. Here we are running sequences with four input tokens and four output tokens; each color is a different sequence.


We can scale up the same trick to run Llama-3-70B with 2048 input tokens and 128 output tokens. Have each batch consist of 2048 input tokens for one sequence, and 127 output tokens for 127 different sequences.

If we do this, each batch will require about (2048 + 127) × 70B params × 2 bytes per param = 304 TFLOPs, while only needing to load 70B params × 2 bytes per param = 140 GB of model weights and about 127 × 64 × 8 × 128 × (2048 + 127) × 2 × 2 = 72GB of KV cache weights. That’s far more compute than memory bandwidth: an H200 would need 6.8 PFLOPS of compute in order to max out its memory bandwidth. And that’s at 100% utilization - if utilization was 30%, you’d need 3x more.

Since Sohu has so much compute with very high utilization, we can run enormous throughputs without bottlenecking on memory bandwidth.

In the real world, batches are much larger, input lengths vary, and requests arrive in a Poisson distribution15. This technique works even better in these cases, but we use the 2048/128 benchmark for this example because NVIDIA and AMD use it.
How does software work?
On GPUs and TPUs, software is a nightmare. Handling arbitrary CUDA and PyTorch code requires an incredibly complicated compiler. Third-party AI chips (AMD, Intel, AWS, etc.) have together spent billions on software to little avail.

But since Sohu only runs transformers, we only need to write software for transformers!

Most companies running open-source or internal models use a transformer-specific inference library like TensorRT-LLM, vLLM, or HuggingFace’s TGI. These frameworks are very rigid - while you can tweak model hyperparameters, changing the underlying model code is not really supported. But this is fine - since all transformer models are so similar (even text/image/video ones), tweaking the hyperparameters is all you really need.

While this supports 95% of AI companies, the largest AI labs go custom. They have teams of engineers hand-tuning GPU kernels to squeeze out slightly more utilization, reverse engineering things like which registers have the lowest latency to each tensor core.

With Etched, you don’t need to reverse engineer anymore - our software, from drivers to kernels to the serving stack, will be open source. If you want to implement a custom transformer layer, your kernel wizards are free to do so.


Etched will be first
If this bet seems crazy now, imagine making it in 2022. When we started, ChatGPT didn’t exist! Image and video generation models were U-Nets, self-driving cars were powered by CNNs, and the transformer architecture was far from ubiquitous.

Luckily, the landscape has shifted in our favor. The top models in every domain—from language to vision—are now transformers. This convergence has not just validated the bet, but is also making Sohu the most important hardware project of the decade.

We’re on track for one of the fastest chip launches in history:

Top AI researchers and hardware engineers left every major AI chip project to join us
We’ve partnered directly with TSMC on their 4nm process. We’ve secured enough HBM and server supply from top vendors to quickly ramp our first year of production
Our early customers have reserved tens of millions of dollars of our hardware

If we’re right, Sohu will change the world
What would happen if AI models get 20x faster and cheaper overnight?

Today, Gemini takes over 60 seconds to answer a question about a video16. Coding agents cost more than software engineers and take hours to finish tasks. Video models generate one frame per second17, and even OpenAI ran out of GPU capacity when ChatGPT signups hit 10M users, just 0.15% of the world.

We aren’t on track to fix it - even if we do keep making GPUs bigger, at a rate of 2.5x every two years, it will take a decade to make video generation real-time.

But with Sohu, it’ll be instant. What happens when real-time video, calls, agents, and search finally just work?

Soon, you can find out. Please apply for early access to the Sohu HDK here.

And if you're excited about solving the compute crunch, we'd love to meet you. This is the most important problem of our time. Please apply for one of our open roles here.

###
https://github.com/apple/ml-hypercloning
Apple
HyperCloning

Submitted on 19 Sep 2024 (v1), last revised 20 Sep 2024 (this version, v2)]
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
Mohammad Samragh, Iman Mirzadeh, Keivan Alizadeh Vahid, Fartash Faghri, Minsik Cho, Moin Nabi, Devang Naik, Mehrdad Farajtabar
The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

Why Train Larger Language Models from Scratch?
Is it better to train larger language models from the ground up, or can we leverage existing pretrained models to create larger variants?
Our latest research at Apple introduces HyperCloning, a revolutionary method that enables the extension of smaller, trained language models into larger ones.
HyperCloning facilitates the transfer of knowledge from the smaller model to the larger model, achieving training speeds that are 2-4 times faster than traditional methods! 💡


This software project accompanies the research paper, Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization.

HyperCloning can pass the knowledge from a small pre-trained LLM to a large LLM. The larger LLM can undergo fine-tuning to get improved accuracy.

Illustration of HyperCloning for Linear Layers

Installation
pip install -r requirements.txt

Sample Code
The following snippet shows how to clone a source model into a destination model:

from transformers import AutoModelForCausalLM
from hypercloning import cloneModel

# instantiate the source model (pretrained):
source_model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

# Clone a model with 2x embedding size and 2x FFN dimension:
destination_model = cloneModel(source_model, embedding_dim_multiplier=2, up_project_multiplier=2)
You may modify and run the following to perform cloning on supported models:

python examples.py
Supported Models
The following families of models are currently supported:

OPT
Pythia
OLMo
Gemma
Llama


###
https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
Microsoft
11/15/24

Oh wow! Microsoft Research released 1 MILLION synthetic instruction pairs covering different capabilities, such as text editing, creative writing, coding, reading comprehension, etc - permissively licensed 🔥
Explore it directly on the Hugging Face Hub!
Kudos MSFT! Let the fine-tuning games begin! ⚡
Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework.

AgentInstruct is an extensible agentic framework for synthetic data generation.

This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative writing, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base LLM.

Using the full AgentInstruct data (a superset of this dataset with ~25M instruction pairs), we post-trained Mistral-7b. When comparing the resulting model Orca-3-Mistral to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval.

Please refer to AgentInstruct: Toward Generative Teaching with Agentic Flows for further details.

Dataset Sources
Repository: microsoft/orca-agentinstruct-1M-v1

Paper: AgentInstruct: Toward Generative Teaching with Agentic Flows

Direct Use
This is a general dataset created by the AgentInstruct methodology. It contains diverse synthetic data. This could be used for research on instruction tuning and synthetic data use and generaiton.

Out-of-Scope Use
This dataset is being shared for research purposes. For training models to perform real-world tasks, we recommend further testing and validation where needed.

This dataset is not intended for use in educational systems or organizations, or for use in health systems.

###
https://github.com/microsoft/TinyTroupe
Microsoft
11/15/24

TinyTroupe 🤠🤓🥸🧐
LLM-powered multiagent persona simulation for imagination enhancement and business insights.

Microsoft released a neat Python library to run LLM-powered multi-agent simulations of people with personalities, interests, and goals.

A tiny office with tiny people doing some tiny jobs.

TinyTroupe is an experimental Python library that allows the simulation of people with specific personalities, interests, and goals. These artificial agents - TinyPersons - can listen to us and one another, reply back, and go about their lives in simulated TinyWorld environments. This is achieved by leveraging the power of Large Language Models (LLMs), notably GPT-4, to generate realistic simulated behavior. This allow us to investigate a wide range of convincing interactions and consumer types, with highly customizable personas, under conditions of our choosing. The focus is thus on understanding human behavior and not on directly supporting it (like, say, AI assistants do) -- this results in, among other things, specialized mechanisms that make sense only in a simulation setting. Further, unlike other game-like LLM-based simulation approaches, TinyTroupe aims at enlightening productivity and business scenarios, thereby contributing to more successful projects and products. Here are some application ideas to enhance human imagination:

Advertisement: TinyTroupe can evaluate digital ads (e.g., Bing Ads) offline with a simulated audience before spending money on them!
Software Testing: TinyTroupe can provide test input to systems (e.g., search engines, chatbots or copilots) and then evaluate the results.
Training and exploratory data: TinyTroupe can generate realistic synthetic data that can be later used to train models or be subject to opportunity analyses.
Product and project management: TinyTroupe can read project or product proposals and give feedback from the perspective of specific personas (e.g., physicians, lawyers, and knowledge workers in general).
Brainstorming: TinyTroupe can simulate focus groups and deliver great product feedback at a fraction of the cost!
In all of the above, and many others, we hope experimenters can gain insights about their domain of interest, and thus make better decisions.

We are releasing TinyTroupe at a relativelly early stage, with considerable work still to be done, because we are looking for feedback and contributions to steer development in productive directions. We are particularly interested in finding new potential use cases, for instance in specific industries.

Note

🚧 WORK IN PROGRESS: expect frequent changes. TinyTroupe is an ongoing research project, still under very significant development and requiring further tidying up. In particular, the API is still subject to frequent changes. Experimenting with API variations is essential to shape it correctly, but we are working to stabilize it and provide a more consistent and friendly experience over time. We appreciate your patience and feedback as we continue to improve the library.

Caution

⚖️ Read the LEGAL DISCLAIMER. TinyTroupe is for research and simulation only. You are fully responsible for any use you make of the generated outputs. Various important additional legal considerations apply and constrain its use, please read the full Legal Disclaimer section below before using TinyTroupe.

Contents
📚 Examples
🛠️ Pre-requisites
📥 Installation
🌟 Principles
🏗️ Project Structure
📖 Using the Library
🤝 Contributing
🙏 Acknowledgements
📜 Citing TinyTroupe
⚖️ Legal Disclaimer
™️ Trademarks
Examples
To get a sense of what TinyTroupe can do, here are some examples of its use. These examples are available in the examples/ folder, and you can either inspect the pre-compiled Jupyter notebooks or run them yourself locally. Notice the interactive nature of TinyTroupe experiments -- just like you use Jupyter notebooks to interact with data, you can use TinyTroupe to interact with simulated people and environments, for the purpose of gaining insights.

Note

Currently, simulation outputs are better visualized against dark backgrounds, so we recommend using a dark theme in your Jupyter notebook client.

🧪Example 1 (from interview_with_customer.ipynb)
Let's begin with a simple customer interview scenario, where a business consultant approaches a banker:

An example.

The conversation can go on for a few steps to dig deeper and deeper until the consultant is satisfied with the information gathered, for instance a concrete project idea:

An example.

🧪EXAMPLE 2 (from advertisement_for_tv.ipynb)
Let's evaluate some online ads options to pick the best one. Here's one example output for TV ad evaluation:

An example.

Now, instead of having to carefully read what the agents said, we can extract the choice of each agent and compute the overall preference in an automated manner:

An example.

🧪 EXAMPLES 3 (from product_brainstorming.ipynb)
And here's a focus group starting to brainstorm about new AI features for Microsoft Word. Instead of interacting with each agent individually, we manipulate the environment to make them interact with each other:

An example.

After running a simulation, we can extract the results in a machine-readable manner, to reuse elsewhere (e.g., a report generator); here's what we get for the above brainstorming session:

An example.

You can find other examples in the examples/ folder.

Pre-requisites
To run the library, you need:

Python 3.10 or higher. We'll assume you are using Anaconda, but you can use other Python distributions.
Access to Azure OpenAI Service or Open AI GPT-4 APIs. You can get access to the Azure OpenAI Service here, and to the OpenAI API here.
For Azure OpenAI Service, you will need to set the AZURE_OPENAI_KEY and AZURE_OPENAI_ENDPOINT environment variables to your API key and endpoint, respectively.
For OpenAI, you will need to set the OPENAI_API_KEY environment variable to your API key.
By default, TinyTroupe config.ini is set to use some specific API, model and related parameters. You can customize these values by including your own config.ini file in the same folder as the program or notebook you are running. An example of a config.ini file is provided in the examples/ folder.
Important

Content Filters: To ensure no harmful content is generated during simulations, it is strongly recommended to use content filters whenever available at the API level. In particular, if using Azure OpenAI, there's extensive support for content moderation, and we urge you to use it. For details about how to do so, please consult the corresponding Azure OpenAI documentation. If content filters are in place, and an API call is rejected by them, the library will raise an exception, as it will be unable to proceed with the simulation at that point.

###
https://github.com/opendatalab/MinerU/tree/master/projects/gradio_app
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
Trending on GitHub on top! Learn more about the project and how you can explore it with their Gradio app 👇
MinerU Key Features:
> Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence.
> Output text in human-readable order, suitable for single-column, multi-column, and complex layouts.
> Preserve the structure of the original document, including headings, paragraphs, lists, etc.
> Extract images, image descriptions, tables, table titles, and footnotes.
> Automatically recognize and convert formulas in the document to LaTeX format.
> Automatically recognize and convert tables in the document to LaTeX or HTML format.
> Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
> OCR supports detection and recognition of 84 languages.
> Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
> Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality.
> Supports both CPU and GPU environments.
> Compatible with Windows, Linux, and Mac platforms.
Application interface based on Gradio that you can build locally

###
https://huggingface.co/papers/2410.16540
A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration
Published on Oct 22
Authors:
Yingqian Cui
,
Pengfei He
,
Xianfeng Tang
,
Qi He
,
Chen Luo
,
Jiliang Tang
,
Yue Xing
Abstract
Few-shot Chain-of-Thought (CoT) prompting has demonstrated strong performance in improving the reasoning capabilities of large language models (LLMs). While theoretical investigations have been conducted to understand CoT, the underlying transformer used in these studies isolates the CoT reasoning process into separated in-context learning steps (Stepwise ICL). In this work, we theoretically show that, compared to Stepwise ICL, the transformer gains better error correction ability and more accurate predictions if the reasoning from earlier steps (Coherent CoT) is integrated. Given that this coherent reasoning changes the behavior of the transformer, we further investigate the sensitivity of the transformer with Coherent CoT when the demonstration examples are corrupted at the inference stage. Our theoretical results indicate that the transformer is more sensitive to errors in intermediate reasoning steps than the final outcome. Building upon this observation, we propose an improvement on CoT by incorporating both correct and incorrect reasoning paths in the demonstration. Our experiments validate the effectiveness of the proposed approach.

Does Incorrect Reasoning + Explanation help improve Chain-of-Thought (CoT) prompting? A new paper measures how “extended” few-shot CoT prompting with both correct and incorrect reasoning paths boosts LLM reasoning.
Implementation
1️⃣ Select your base LLM (e.g. GPT-3.5, GPT-4, Gemini Pro, DeepSeek 67B)
2️⃣ Create samples with correct step-by-step reasoning
3️⃣ Generate incorrect reasoning paths
4️⃣ Add/generate explanations for what is wrong with the incorrect path
5️⃣ Combine both correct and incorrect paths with explanations in your prompts
Insights
💡 Model-generated incorrect examples work better than handcrafted ones
🚫 Including wrong reasoning without explanations hurts performance
🎯 Few-Shot examples selection impacts performance
❌ LLMs are more sensitive to errors in intermediate reasoning steps than in the final answer.
🚀 Incorrect Reasoning + Explanations improves performance across all LLMs


###
https://www.adaptive-ml.com/post/from-zero-to-ppo
Adaptive ML
From Zero to PPO: Understanding the Path to Helpful AI Models
Research
November 19, 2024

A non-Instruct LLM assistant is mostly useless. 🧐
Since it's mostly a model trained to complete text, when you ask it a question like "What to do during a stopover in Paris?", it can just go on and on adding more details to your question instead of answering, which would be valid to complete text from its training corpus, but not to answer questions.
➡️ So the post-training stage includes an important Instruction tuning step where you teach your model how to be useful : answer questions, be concise, be polite... RLHF is a well known technique for this.
For people interested to understand how this step works, the folks at Adaptive ML have made a great guide!


Model: Llama 3.1 8B (pre-trained)
Question: What to do during a stopover in Paris?
Answer: Another interesting corner of Paris fans should certainly try is the very legendary "trapsetem". Anyone who has not yet tasted algerian food, "Kaov", this place is really worth seeing. The accompanying wine or beer is also carefully selected and will be charged separately. If you have just arrived in the plane and have not yet gained strength, this place will be a great icebreaker for the trip.

That's... not quite right! Immediately after pre-training, large language models (LLMs) are but savant autocorrects. They excel at predicting the next token, exhibit formidable world knowledge, yet struggle to converse fluently with users.

Helpfulness is instilled in LLMs as a result of extensive post-training. Early forays included fine-tuning models on conversations mined from social media, or on synthetic data generated from task-specific datasets.

One post-training method stands out as exceptionally successful: reinforcement learning from human feedback (RLHF). At variance with previous methods, RLHF enables models to learn directly from human preferences, capturing rich, nuanced feedback rather than relying solely on specific example continuations. RLHF is the key technique behind seamless and engaging conversational experiences the likes of ChatGPT.

One of the de facto engines of RLHF is proximal policy optimization (PPO). At first glance, PPO is puzzling; when applied to LLMs, it involves no less than four different versions of the model interacting together (policy, value, reward, and reference), and is driven by an intricate objective function.

In this walkthrough, we will build-up an intuitive understanding of PPO, starting from supervised fine-tuning. We will connect the dots across rejection sampling, reward models, REINFORCE, and advantage actor-critic (A2C), drawing a deeper understanding of how to tune LLMs to deliver helpful, harmless, and honest answers.

Supervised finetuning: maximizing the likelihood of helpful answers
At the start of this post, we asked Llama 3.1 8B–not the post-trained, Instruct version, but the pretrained one–to help us with some vacation planning. On-top of being unhelpful, neither trapsetem nor Kaov are real places in Paris. Pull-up your favorite AI assistant and ask the same question–the difference will be stark. Why is that?

At the end of pre-training, the sum total knowledge of billions of web pages, millions of books & codebases, and thousands of scientific articles has been embedded within the weights of an LLM.

Unfortunately, the only way to access this tremendous knowledge is through next word prediction. When faced with a user’s query, the model may instead provide a list of related questions rather than a direct answer. Or provide a very specific, not entirely applicable answer. Asking for further clarification will only add to the confusion. Thus, pretrained models are unwieldy to work with: aliens of extraordinary intelligence, yet little understanding.


Partly, this is a context & diversity problem. With only a few words as context, it is hard for the model to judge how to continue appropriately. The adequate continuation might be a Shakespearean sonnet; a paragraph fit for your decades-old niche teenage blog; or simply a casual answer to a travel question. All of this, and more, are present in the pre-training corpus.

Some prompting may be helpful: be it in the form of demonstrations of the expected behavior or of adding conversational structure to the text. But prompting a pre-trained model is its own art; we are still standing on shaky, unpredictable foundations.

Let’s fix that. We will tune the desired behavior into the weights of the model. Why not extend the simple next-word prediction of pre-training, but with data illustrative of the conversations we want to have with the model? This is the basis of supervised fine-tuning (SFT).



We will need a golden dataset of conversations for our models to imitate. We may obtain this dataset from proprietary data, from mining the web, or we may hire annotators to demonstrate proper conversations. Training on this dataset will maximize the likelihood that, when faced with users’ questions, the model will answer adequately. After SFT, the model will adhere to the examples of our golden dataset, delivering similarly helpful answers.

Wasn’t that easy? Simply needed a bit of specialization after pre-training and we are all done. Ship to production! This is the only tuning technique we need… right?

Not quite. Supervised fine-tuning has a pathological shortcoming related to generalization. SFT provides specific demonstrations of the right answer to the model ex-nihilo: the model did not come up with them on its own. SFT gives the LLM a fish; it does not teach it to fish.

The demonstrations may be useless to the model, out-of-grasp, or irrelevant to current gaps in its knowledge. Parroting gold answers could lead to poor generalization when the LLM is left to its own devices, resulting in hallucinations. Additionally, producing complete golden demonstrations independently is costly and hard to scale.

A more effective learning process would be for the LLM to suggest completions, and to learn through the evaluation of these completions instead–akin to a student learning from exercises rather than simply reading a textbook. Let’s build that better learning process!

Learning from rejection sampling: finding the model’s own helpful answers
During pre-training and supervised fine-tuning, there is no iterative inference. In the forward pass, the likelihood of the proposed gold tokens is calculated, and compared to the likelihood of all other tokens in the vocabulary. The training then maximizes the likelihood of the proposed tokens. Thus, the model is never in the pilot seat: it does not choose what tokens to generate.

Instead, we now seek to learn from answers generated by the model itself. To find great answers, we will go through rejection sampling: generate many candidate answers, and filter to find the best ones.

Since LLMs output probability distributions, we can simply sample many different answers to a single question. This can be done by increasing the temperature of sampling. We now have a riff-raff of candidate answers, and it’s time for evaluation.

The answers need ratings, to let us select the best ones only. Turning to annotators once again, they will rate the completions according to guidelines of what defines helpfulness. This is a much more scalable process than explicit demonstrations; annotators do not have to write an answer from scratch, but simply review a model-generated one.

At this stage, we can fine-tune on the top-rated completions. Is this effective? Yes! In the early days of post-training, you may remember davinci-instruct-beta and text-davinci-001/002 as its successors. Well, the beta model was trained using SFT on gold demonstrations. Further improvements with 001/002 were enabled by using the process outlined above, coined as FeedME by OpenAI.



Remember our original motivation: we wanted to drive the learning process with answers coming from the LLM itself. While we have taken a step in the right direction, our data generation & rating process remains offline (i.e., it happens once before SFT). Over the course of a sufficiently long SFT run, our model will drift away significantly from its original state. Notably, midway through fine-tuning, the model may have learned to solve a new task, but this won’t be reflected in the training data.

Ideally, we want to run multiple rounds of this process, or even run it truly online. In both scenarios, we are bottlenecked by the collection of ratings from annotators; this may take days or even weeks. Time to automate that process.

Reward modeling: let models evaluate the helpfulness of answers
Rather than always requiring an annotator to produce ratings, we will handoff this task to a reward model (RM).

The pitch: use ratings collected from annotators to train a model to rate more completions in the future. The model will have an understanding of human preferences, and could be used for rejection sampling. Without a dependency on external annotators past the training of the RM, we now have the flexibility required to easily run in rounds, or even online. This approach was extensively explored in the Llama 2 paper.

Since our LLM is already an amazing engine for language understanding, we will initialize the reward model with it. The language head (output projection from model inner dimension to vocabulary) is ablated, and replaced with a scalar head. The RM produces a single reward value.

One challenge in collecting ratings from annotators is calibration: i.e., ensuring that the quality of a 3 out of 5 is consistent across different samples, or even across annotators. Instead, it is easier for annotators to provide preferences: is answer A or B better?

Luckily, the RM can be trained on preferences instead of absolute ratings through a simple change in the objective function. Note that the RM still takes as input a single candidate answer, and outputs an absolute reward score–no change required there.



We have significantly augmented our ability to evaluate LLM samples. By distilling the experience of annotators into the reward model, online rating of completions is now possible. Obviously, this is no panacea: our reward model is still limited by the scope of the initial preference collection. Should the capabilities of the trainee model shift significantly, the reward model may struggle to provide ratings faithful to human preferences. Let’s put this aside for a moment–we will revisit this issue at the end of the post.

It’s time to think about efficiency. Remember, thus far, we have only trained on the top-rated completions. Of the millions we generate, we may only use a few thousand. What if we could also learn from the poorly-rated ones? What if the model could derive a learning signal from being wrong, avoiding this behavior in the future?

REINFORCE: learn from the good, and the bad
Encourage top-rated completions, discourage low-rated ones. No extra trick needed: this is all it takes to implement REINFORCE.

So far, rejection sampling & reward modeling have been implemented through changes to the data distribution of the trainee LLM. For REINFORCE, the change occurs at the loss-level (i.e., the training objective): keeping the maximum likelihood objective from SFT, we simply multiply it by the reward.

Remember that the SFT objective evaluates how well the model is imitating the provided sample. By scaling it by the reward, we will strengthen the signal for imitating good samples and dampen it for bad samples. Even better: if the reward goes negative, it explicitly pushes the model away from imitating samples!

Thus, we will need to ensure the reward is adequately scaled. Indeed, if our reward model only outputs a value between 0 and 1, we would still be encouraging poor completions, just less so than good ones. Bad completions need to have a negative score.

Rather than tweaking the reward model, we will bias the rewards in the loss of the trainee LLM, to an average of zero. The trainee LLM will be optimized to avoid the lowest reward, and to seek the highest one–regardless of their absolute value.



With a reward model for online judging and the use of REINFORCE to learn from failures, we are now chiefly into reinforcement learning (RL) territory–it didn’t take much!

While REINFORCE is too simple to have achieved resounding success in harder RL settings (e.g., Atari games), it is sufficient to effectively tune LLMs to be helpful. Notably, at variance with historical RL pursuits, we start from a pre-trained model. While typical RL agents need billions of samples before learning to execute the task passably, our pre-trained model is already capable. Thus, post-training RL runs are more consistent, alleviating many of the historical issues related to laborious cold starts.

It’s time to up the ante. The reward is currently attributed at the completion-level: i.e. there is a single reward for the entire answer. Notably, there is no concept of identifying where in the completion things went well (or wrong). Back to our example:

‍Question: What to do during a stopover in Paris?
Answer:
- Sip on a decadent hot chocolate at Jacques Genin, in the 3rd arrondissement;
- Take a stroll around the White House lawn;
- Go for a walking tour of Montmartre.

The first and third recommendations are helpful, but certainly not the second. Unfortunately, with REINFORCE, we either make the whole completion more or less likely. This is unfair! The LLM actually had ⅔ of a reasonable answer. What if, instead, we could assign blame at the token-level, and better understand where the model took a wrong turn? Let’s tackle the problem of credit assignment.

Advantage actor-critic (A2C): understand what makes an answer helpful
As we foray deeper into reinforcement learning, we need to adopt some of its terminology. RL deals with states and actions. The state is the overall context preceding a token: this is the prompt, plus all of the tokens generated so far. In a given state, the LLM takes an action: it generates a new token. Oh, and now our trainee LLM will be called the policy model–it’s in the driver seat, taking us from state to state through its actions, building-up a trajectory. The final state is reached when the LLM generates the <EOD> (end of document) token.

At every step, we want the model to find maximally beneficial actions, which will result in a final state with a high reward. Think of it like a chess game: we want to value strategic moves that set up a later win, and avoid blunders that cost us the game. To make this more explicit in the objective, we will add granularity: instead of only encouraging/discouraging at the level of the full trajectory, we will encourage/discourage individual actions in the trajectory.

To do so, couldn’t we simply use the reward on unfinished completions? Not quite. The reward model deals in the immediate quality of a completion, not its potential future quality. It has no concept of how valuable a current state is for the future. Indeed, incomplete answers would be rated poorly by the reward model.

Instead, we want to evaluate how valuable a given state is. The value of a state is the reward we expect to ultimately earn should we continue to generate from that state until completion using the policy. Evaluating this will be the responsibility of the value model. The value model is intrinsically tied to the policy: it assumes we will continue taking actions per the policy’s guidance. Accordingly, we will need to train it alongside the policy.



The value model is no different in its mechanics (i.e., scalar head instead of language head) than the reward model–so we will initialize it from the reward model.

To train the value model, we will use a simple regression objective. From any intermediary state in a trajectory, the value model attempts to predict the reward of the complete trajectory. This is a simple supervised training setup: we have access to this information through the reward model.

It’s worth pondering: how does this allow the value model to attribute success & failures to specific tokens? Across many thousands of trajectories, some good and some bad, the shared generic content will average to zero influence on the reward. Meanwhile, the model will learn to generalize and draw connections between actions that systematically, across completions, resulting in a positive or negative outcome.

Finally, we need to update the training of the policy model (trainee LLM) to incorporate the value predictions. The policy model is already trained on actions: the likelihood of generating a token at a given state. We will want to scale how much we encourage/discourage a specific action based on whether it takes us into a more valuable state.

To do so, we need to calculate the action’s advantage. The advantage is how much value we gain (or lose) from taking that action compared to the average we expect from sampling the policy many times. (Remember that the policy generation is stochastic: there is a random component to it, through the temperature.) Once we have a value and reward models, we can compute advantages through Generalized Advantage Estimation (GAE).

Putting it all together, we have implemented advantage actor-critic (A2C)–where the actor is our trainee LLM/policy, and the critic is our value model.



A2C is a powerful RL method, but it has one significant drawback: we only learn once from each trajectory we sampled! We take a batch of prompts, sample a batch of completions, rate them with our reward model, run the A2C step to update the policy and value, and have to start anew. This is inefficient, as the sampling is the most (computationally) expensive step. Even more so when tool use (e.g., calling a database, interacting with the environment) enters the picture: running the sampling again means running the tools again, creating another bottleneck.

Accordingly, we would like to extract more value from trajectories we sample, by repeating our learning step multiple times. Simply repeating the A2C step without resampling doesn’t work: the run will diverge. Let’s do it in a principled way.

Proximal policy optimization (PPO): learn more from model answers
At last, proximal policy optimization (PPO). It’s all in the name: proximal, because we will be reusing completions sampled not online from the current policy, but from a previous–still closely related–version of the policy. We will call it the sampling policy.

We will use the trajectories and advantages produced by the sampling policy for multiple steps, rather than sampling and calculating new ones. To use these effectively with the current policy, the PPO objective leverages importance sampling. Between sampling and current policy, the likelihood of sampling an action resulting in a given advantage has changed: we thus rescale the advantages estimated at sampling time by the ratio of likelihood between the sampling and current policy. This allows us to use these advantages for many steps.

Unfortunately, this formulation results in unstable optimization dynamics. Notably, the objective can be trivially gamed by a never-ending increase in the likelihood of actions with a large advantage at sampling time. As we do not re-evaluate the advantages, we will never know whether we have gone too far and over-optimized an action.

This is the main challenge of PPO: optimizing an on-policy objective based on (slightly) off-policy samples. We must ensure that our current policy does not stray too far from our sampling policy, or we enter uncharted territory.

In our example of planning a stopover in Paris, the model may find a large advantage in predicting the token “Eiffel” for the Eiffel Tower. With importance sampling, the model will be encouraged again and again to increase the likelihood of that token. There is no mechanism to moderate this behavior. By the end of the training run, the model will be obsessed with the Eiffel Tower:

‍Question: What to do during a stopover in Paris?Answer:
1. Visit the Eiffel Tower;
2. Visit the Eiffel Tower;
3. Visit the Eiffel Tower.

Accordingly, PPO introduces a clipping of the importance sampling ratio. This clipping acts in two ways. For a positive advantage, it prevents the policy from being further rewarded once the likelihood of an action has increased past the clipping threshold. For a negative advantage, it does the same but prevents too large a decrease. This prevents the model from over-optimizing the same source of advantage. Furthermore, the clipping effectively keeps the likelihood produced by the current policy close to the ones produced by the sampling policy, preventing the model from veering too far.

Let’s summarize what our PPO implementation looks like:

‍Outer generation loop: repeat until prompt dataset is exhaustedsome text

Take a batch of prompts from our dataset and sample completions with the policy model (the sampling policy), setting aside the sampling likelihoods;‍
Evaluate completions with the reward model to obtain the rewards;
Evaluate completions with the value model to obtain the values;
Calculate the advantages using the rewards and values;
Inner optimization loop: repeat for k stepssome text
Evaluate current likelihoods of completions using the current policy model;
Compute the importance sampling ratio between sampling and current likelihoods, this ratio is 1 for the first step (sampling policy is current policy);
Update policy model using the PPO objective.
Fit value model using the rewards and values.


Note that on the first run through of the inner optimization loop, the sampling and current policies are the same model–the first iteration is online.

Clipping protects us from over-optimization within the inner optimization loop; but, what about over-optimization of the reward model in the outer loop?

PPO+KL penalty: one last trick for principled training
Unfortunately, the Eiffel Tower example can arise from the reward model too–across SFT from rejection sampling, REINFORCE, A2C, and PPO. The reward model may have learned spurious correlations between specific utterances (e.g., “delve into”, “certainly”) and human preferences. In turn, the policy may over-optimize for these dubious connections. Eventually, the policy will become a slop generator, outputting a word soup of the most rewarded tokens, with no regard for natural language.

Once again, this issue is related to online vs offline optimization. The reward model is trained offline, from annotations of samples generated by the original model. Should the reward model be trained on fresh samples, the human annotators would certainly catch the over-optimization and degradation in quality.

A simple solution is to split post-training into multiple rounds of PPO. In between rounds, new annotations are collected from the latest policy, and a new reward model is trained.

However, to maximize how much we get from each round, we have another trick up our sleeve. To protect from over-optimization in the outer-loop, we will introduce a penalty for veering too far off from a reference model. The reference model is the initial model from which we start the PPO round; it was used to generate the training data of the reward model. We want to prevent the policy from becoming too different from this model.

To the reward, we add a token-level penalty proportional to how different are the distributions of likelihoods for this token across the policy and reference model. To do so, we use the Kullback-Leibler divergence (KL). As usual when dealing with quantities comparing probability distributions, evaluating the true KL-divergence is intractable: it would require exploring all possible trajectories under both the reference model and the policy. Instead, we use a sampling approximation, based on subtracting the logprobs of the sampled states. This estimate is bias-free, although it has a high-variance.

OK – let’s put it all together:

Outer generation loop: repeat until dataset is exhaustedsome text

Take a batch of prompts from our dataset and sample completions with the policy model (the sampling policy), setting aside the sampling likelihoods;
Evaluate the completions with the reference model to obtain the reference likelihoods;
Calculate the token-wise KL penalty using reference and sampling likelihoods. ‍
Evaluate completions with the reward model and deduce the KL-penalty to obtain the rewards;
Evaluate completions with the value model to obtain the values;
Calculate the advantages using the rewards and values;
Inner optimization loop: repeat for k steps‍
Calculate current likelihoods of completions using the current policy model;
Compute the importance sampling ratio between sampling and current likelihoods, this ratio is 1 for the first step (sampling policy is current policy);
Update the policy model using the PPO objective;
Fit value model using rewards and values.


Post-training from human preferences… and beyond
Starting from supervised fine-tuning (SFT), we have built up to proximal policy optimization (PPO). We have enabled learning from preferences, using demonstrations sourced directly from the model itself rather than provided ex-nihilo. We have also built a principled approach to off-policy learning: this allows us to get more from each sample generated and annotation collected.

Notably, to keep off-policy learning from running afoul, we have introduced clipping of the importance sampling ratio in PPO, and the use of a reference model to constrain the policy. Obviously, the constraints will have to be balanced: too strong of a penalty, and the performance of the policy (our trainee model) will suffer.

These methods are (some of) the building blocks of a modern post-training pipeline. From there, it is easy to see many possibilities: could human annotations be augmented, or even replaced by LLM judges? Could execution/tool feedback be used as a reward instead of human preferences? Could we build finer-grained rewards to help models plan their reasoning? Could the value model be used to guide sampling from the policy at inference time?

A mature pipeline will combine multiple stages of SFT, PPO, and more. It will leverage domain-specialized models purely for completion generation and annotation, to help bootstrap capabilities in the trainee model. It will fork, select, and merge the most promising models. Ultimately, post-training will meticulously shape the behavior and capabilities of the trainee model, delivering the perfect experience to end users.



Authors: Alessandro Cappelli, Laetitia Teodorescu, Julien Launay
Edited by: Pearson Probst

###
https://github.com/av/harbor/wiki/2.3.27-Satellite:-K6#standalone-setup
11/19/24
Harbor

TIL: You can now use k6 to load test your LLM Applications! 😍 Harbor added an OpenAI-compatible client to easily benchmark and load-test AI solutions with Grafana Labs k6! 🔥
TL;DR:
✅ load-testing using k6 for OpenAI-compatible APIs
📊 Comes with pre-configured Influx and Grafana dashboards
🔧 Built-in helpers for OpenAI-compatible clients and custom datasets
📈 Example scripts Local and hosted Endpoints to easily quick start
🤗 Supports Hugging Face TGI, vLLM or Ollama
OpenAI-helper:
https://lnkd.in/e2bstTn8
Docs:
https://lnkd.in/eFWNKm_k
Example:
https://lnkd.in/e2dx9zRj

2.3.27 Satellite: K6
av edited this page 2 days ago · 6 revisions
K6
Handle: k6
URL: http://localhost:34233

k6 is a modern load-testing tool. It's built to be powerful, extensible, and full-featured. The key design goal is to provide the best developer experience.

Harbor's k6 comes with a few additional features that make it useful for LLM-related tasks out of the box:

Pre-configured Influx and Grafana instances for metric collection and display
Built-in Grafana dashboard for OpenAI-compatible API stats
Built-in k6 helpers for OpenAI-compatible API clients
Harbor's custom OpenAI-API dashboard for K6

Starting
# [Optional] Pre-pull the images
harbor pull k6

# Run K6 CLI (connected to "k6 run" by default)
harbor k6 --help
Harbor mounts a local workspace located in $(harbor home)/k6 to the container. You can use it to store your scripts, configuration, and other data.

$(harbor home)/k6/scripts - K6 scripts
$(harbor home)/k6/dashboards - Grafana dashboards
Usage
In most instances, your workflow will be centered around editing and running performance testing scripts. You can find detailed guides on k6 functionality in the official documentation.

Here are a few helper commands to get you started:

# Open K6 scripts in VS Code
code $(harbor home)/k6/scripts

# Run a given script
# Note the forward slash - it's witihn a k6 container
# This will print Grafana URL to monitor the test
harbor k6 run /scripts/my-script.js

# Open Grafana dashboard
harbor open k6
The k6 CLI will start Grafana/InfluxDB if needed - they'll continue to be running in the background until stopped explicitly (meaning you can collect results for more than one test).

The DB storage is not permanent and the contents will be erased after stopping the service.

Harbor comes with a few example scripts to get you started:

/scripts/example.js - the "hello world" of K6
/scripts/ollama.js - basic example of profling Harbor's built-in Ollama instance
/scripts/openai-api-throughput.js - showcases all the features added by Harbor, used in the screenshot above
OpenAI-compatible API scripting
Harbor's k6 has a built-in helper for OpenAI-compatible API clients. You can use it to easily create scripts that interact with the API.

import * as oai from './helpers/openaiGeneric.js';

const client = oai.createClient({
// URL of the API, note that
// "/v1" is added by the helper
url: 'http://ollama:11434',
options: {
// A set of options that'll be added to every
// request sent by this client. In other words -
// a subset of the body of the request for the given endpoint
model: 'qwen2.5-coder:1.5b-base-q8_0',
},
});

// /v1/completions endpoint
const response = client.complete({
prompt: 'The meaning of life is',
max_tokens: 10,
// You can specify anything else supported by the
// downstream service endpoint here, these
// will override the "options" from the client as well.
});

// /v1/chat/completions endpoint
const response = client.chatComplete({
messages: [
{ role: "user", content: "Answer in one word. Where is the moon?" },
],
// You can specify anything else supported by the
// downstream service endpoint here, these will
// override the "options" from the client as well.
});
See the helper source for more details.

Standalone Setup
It's recommended to go with the Harbor setup, as it'll only use additional 6MB of disk space and will save you a lot of time. However, it's possible to use the setup standalone.

Requirements:

Docker with standalone compose plugin
Ability to run docker commands without sudo
Free disk space for K6, Grafana, InfluxDB docker images
Step 1: Check out the folder with k6 scripts and grafana resources

Here's an example one-liner that'll put https://github.com/av/harbor/k6 into k6-oai-performance folder at the current location:

git clone --depth 1 https://github.com/av/harbor /tmp/harbor-tmp && cp -r /tmp/harbor-tmp/k6 ./k6-oai-performance && rm -rf /tmp/harbor-tmp
Step 2: Run k6 compose service with docker compose:

# Start InfluxDB, Grafana
docker compose up -d

# Run example script
docker compose run k6 run /scripts/example.js

# After you're done with testing
docker compose down
The configuration and usage are otherwise identical the main documentation above, except that you'll also need to manually spin up any other services you'll be testing.

###
https://ai.meta.com/research/publications/llama-guard-3-vision-safeguarding-human-ai-image-understanding-conversations/
META
CONVERSATIONAL AI
COMPUTER VISION
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
November 20, 2024

Abstract
We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both multimodal LLM inputs (prompt classification) and outputs (response classification). Unlike the previous text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision is fine-tuned on Llama 3.2-Vision and demonstrates strong performance on the internal benchmarks using the MLCommons taxonomy. We also test its robustness against adversarial attacks. We believe that Llama Guard 3 Vision serves as a good starting point to build more capable and robust content moderation tools for human-AI conversation with multimodal capabilities.


###
https://ai.meta.com/research/publications/llama-guard-3-1b-int4-compact-and-efficient-safeguard-for-human-ai-conversations/
META
NLP
CORE MACHINE LEARNING
Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations
November 20, 2024

Abstract
This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

기술적으로 최대한 자세하게 적어. 16개의 기사가 있고 하나도 빼먹지 말고 적어.