hiyouga/LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Easily fine-tune 100+ large language models with zero-code CLI and Web UI
Public notes from activescott tagged with #llm
Easily fine-tune 100+ large language models with zero-code CLI and Web UI
we in- troduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires under standing and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks.
Easy to use, well documented fine-tuning. NVIDIA optimized with AMD support and Apple M support in the works.
continually updating a model's parameters with new data, often leads to “catastrophic forgetting” (CF), where learning new tasks sacrifices proficiency on old tasks. Researchers traditionally combat CF through architectural tweaks or better optimization rules. However, for too long, we have treated the model's architecture (the network structure) and the optimization algorithm (the training rule) as two separate things, which prevents us from achieving a truly unified, efficient learning system.
By defining an update frequency rate, i.e., how often each component's weights are adjusted, we can order these interconnected optimization problems into "levels." This ordered set forms the heart of the Nested Learning paradigm.
We observed that many standard optimizers rely on simple dot-product similarity (a measure of how alike two vectors are by calculating the sum of the products of their corresponding components) whose update doesn't account for how different data samples relate to each other. By changing the underlying objective of the optimizer to a more standard loss metric, such as L2 regression loss (a common loss function in regression tasks that quantifies the error by summing the squares of the differences between predicted and true values), we derive new formulations for core concepts like momentum, making them more resilient to imperfect data.
In a standard Transformer, the sequence model acts as a short-term memory, holding the immediate context, while the feedforward neural networks act as long-term memory, storing pre-training knowledge. The Nested Learning paradigm extends this concept into what we call a “continuum memory system” (CMS), where memory is seen as a spectrum of modules, each updating at a different, specific frequency rate. This creates a much richer and more effective memory system for continual learning.
"Nested Learning" extends the traditional two-tier memory concept of "attention layers" (short-term memory / context window) and "feed-forward network layers" (long term memory) into a spectrum of modules that update at different rates, some very frequently (like attention), some rarely (like FFNs), and others at various points in between.
MLPerf Client is a benchmark developed collaboratively at MLCommons to evaluate the performance of large language models (LLMs) and other AI workloads on personal computers–from laptops and desktops to workstations. By simulating real-world AI tasks it provides clear metrics for understanding how well systems handle generative AI workloads. The MLPerf Client working group intends for this benchmark to drive innovation and foster competition, ensuring that PCs can meet the challenges of the AI-powered future.
We introduce the Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability.
In 2024, SWE-bench & SWE-agent helped kickstart the coding agent revolution.
We now ask: What if SWE-agent was 100x smaller, and still worked nearly as well?
mini is for
Researchers who want to benchmark, fine-tune or RL without assumptions, bloat, or surprises Developers who like their tools like their scripts: short, sharp, and readable Engineers who want something trivial to sandbox & to deploy anywhereHere's some details:
Minimal: Just 100 lines of python (+100 total for env, model, script) — no fancy dependencies! Powerful: Resolves >74% of GitHub issues in the SWE-bench verified benchmark (leaderboard). Convenient: Comes with UIs that turn this into your daily dev swiss army knife! Deployable: In addition to local envs, you can use docker, podman, singularity, apptainer, and more Tested: Codecov Cutting edge: Built by the Princeton & Stanford team behind SWE-bench and SWE-agent.
Rnj-1 is an 8B model that roughly follows the open-source Gemma 3 architecture. We employ global self-attention and YaRN to extend the context to 32k. The Rnj-1 Base and Instruct models compare favorably against similarly sized open weight models.
Rnj-1 Instruct dominates the pack on Agentic coding, one of our target abilities. SWE bench performance is indicative of the model's ability to tackle everyday software engineering tasks. We are an order of magnitude stronger than comparably sized models on SWE-bench and approach the capabilities available in much larger models (leaderboard: SWE-bench-Verified bash-only).
Some interesting subtle things he ever so briefly mentions that I think are notable:
Extracting Claude’s soul.
Today, we release INTELLECT-3, a 100B+ parameter Mixture-of-Experts model trained on our RL stack, achieving state-of-the-art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models.
Our complete recipe — from the model weights and training frameworks, to our datasets, RL environments, and evaluations — has been open-sourced, with the goal of encouraging more open research on large scale reinforcement learning.
Mayo Clinic adopted a reverse RAG technique that effectively eliminated data retrieval hallucinations in their tests. In a traditional RAG setup, an LLM retrieves context from a knowledge source before generating an answer. Mayo’s reverse RAG flips this process: the model first extracts or summarizes information, then links every data point in its output back to the document. By forcing the AI to provide a reference for each fact, Mayo virtually eliminated hallucinations in non-diagnostic use cases, building clinician trust in the results.
The workflow looks like this:
- Data Extraction — The LLM/OCR/API reads the patient’s records (e.g. discharge summaries or outside medical files) and produces a summary or list of facts. This initial output might include details as patient age, diagnoses, lab results, etc.
- Fact Splitting — The AI output is split into individual facts or data points. Each sentence or key piece of information from the summary is treated separately.
- Source Matching — For each fact, the system searches the patient’s records (using a vector database of document embeddings) to locate the original source text that supports that fact. Essentially, the AI is asked: “Where did this piece of information come from?” Every fact must be matched to a snippet in the records (for example, the patient’s age is verified from the admission note, a lab value from the lab report, etc.).
- Verification — A second LLM then compares each fact to the retrieved source text and scores how well they align. It checks that the fact is truly supported by the source and not a misunderstanding or fabrication. Mayo’s team even looked for a causal relationship — ensuring the context implies that fact, not just a coincidental mention.
- Output with References — Only facts with solid support are kept. The final output is delivered with inline citations or links to the original records for every data point. This means physicians can click a link and see exactly where each piece of information came from, ensuring transparency and trust.
LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called.
Visit a Reddit post with Comet and ask it to summarize the thread, and malicious instructions in a post there can trick Comet into accessing web pages in another tab to extract the user's email address, then perform all sorts of actions like triggering an account recovery flow and grabbing the resulting code from a logged in Gmail session.
Anthropic don't recommend autonomous mode - where the extension can act without human intervention. Their default configuration instead requires users to be much more hands-on:
A llama.cpp-based app for running local models.
A great open source alternative that I used for running llms locally without having to use llama.cpp directly.