Getting Caught Up to Modern LLM Research
2025-01-15
Table of Contents
- Introduction
- Part 0: RNNs, Encoder-Decoder, and Attention
- Part 1: The Transformer
- Part 2: Past the Original Transformer
- Part 3: Training Methods and Pipelines
- Part 4: Inference and Training Optimization
- Part 5: Open-Source Advancements
- Part 6: Agents
- Conclusion
Introduction
Catching up to LLM research is hard. It’s too easy to take shortcuts from half-baked Twitter posts and find yourself “knowing” about LLMs, but not necessarily understanding them. There’s a time and place for both, but if you’re looking to get involved in AI research, specifically LLM research, it’s incredibly important to do things the right way, also known as the hard way.
I decided to start doing AI research around a year ago, so I’ve personally struggled with finding the right resources and the right path to best learn the topics needed to get a grasp of the LLM research landscape. There are an innumerable number of topics, subtopics, niches, and ideas to explore, and I know that a subsetted guide introducing me to the field would have been extremely valuable. Well, here it is, the blog post I wish I would have had when starting my own journey.
Preamble
There are two parts to this collection, catered towards two audiences.
Audience one has a solid grasp of PyTorch and an at least surface-level understanding of deep neural networks up until CNNs and RNNs.
Audience two is more focused on learning LLM-specific topics and is willing to skip the nitty-gritty implementation details in the process.
If you have the background for it, meaning you have some programming experience and have taken some sort of intro to ML/AI course in the past, I highly recommend you go through the entirety of the resources listed in this post, as honing your PyTorch and programming skills alongside a theoretical understanding of AI will have the highest payoff. If not or if you’re looking to move faster, no worries. Feel free to skip ahead to Part 1.
Disclaimer: The ordering and presentation of resources on this page are extremely biased to what I found worked for me. If you prefer learning another way, be aware of potential differences.
Technical Prerequisites
If you are already proficient in PyTorch, don’t want to program, or want to skip forward to learning about LLMs, feel free to move ahead to Part 1.
If you don’t have any experience with Neural Networks, 3Blue1Brown’s Neural Networks playlist is a good place to start gaining some intuition around basic deep learning.
If you have some programming experience but have no experience with Neural Networks, I recommend going through the first 4-5 videos of Andrej Karpathy’s Neural Networks: Zero to Hero course.
This is more optional, but if you’re not familiar with CNNs and RNNs and want to learn more, consider taking a look at the first half of UC Berkeley’s CS 182: Introduction to Artificial Intelligence or the first third of Stanford’s CS224N: Natural Language Processing.
If you have no familiarity with PyTorch but have some programming experience, I highly recommend going through this course on PyTorch. LLM research is heavily biased towards engineering skills and applied science, so having confidence in your PyTorch skills has an extremely high payoff.
But what is a neural network?
3Blue1Brown's Neural Networks playlist
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Neural Networks: Zero to Hero
Andrej Karpathy's Neural Networks: Zero to Hero course
https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
Intro to PyTorch
Make sure you can repicate the lessons from scratch before continuing!
https://github.com/mrdbourke/pytorch-deep-learning
Part 0: RNNs, Encoder-Decoder, and Attention
If you’re already familiar with RNNs, attention mechanisms, and encoder-decoder architectures, or if you want to dive straight into LLMs, feel free to skip this section. However, if you’re new to these concepts, the following resources will provide a solid foundation for understanding future work in transformers and LLMs.
Seq2Seq Implementation with RNNs
To grasp the basics of RNNs and how they’re implemented in PyTorch, I recommend working through the following two tutorials.
These tutorials will guide you through implementing both a classifier and a generative model using an RNN backbone. I highly recommend being able to recreate the code for each tutorial from scratch, including the training loop and data loading, without referring to the tutorial.
Classification with RNNs
NLP From Scratch: Classifying Names with a Character-Level RNN — PyTorch Tutorials
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Generation with RNNs
NLP From Scratch: Generating Names with a Character-Level RNN — PyTorch Tutorials
https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html
Encoder-Decoder Architecture
Once you have a solid grasp of RNNs, the next step is to understand the shift towards encoder-decoder architectures. These architectures first emerged in the context of RNNs before being further popularized by transformers. Look over the following papers to understand the basics of the encoder-decoder architecture.
Encoder-Decoder Architecture using RNNs
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
https://arxiv.org/abs/1406.1078
Encoder-Decoder Models for Seq2Seq Tasks
Sequence to Sequence Learning with Neural Networks
https://arxiv.org/abs/1409.3215
The Attention Mechanism
The next building block towards modern LLMs is having a solid understanding of the attention mechanism, one of the key ingredients that power the performance we see in modern LLMs. Attention was first popularized within the NLP community for use with RNNs before being adapted for transformers. The initial transformer paper settled on a version of scaled dot-product attention, but it’s worth exploring the evolution of attention mechanisms to gain deeper understanding. Interestingly enough, a few notable individuals in the deep learning community called out attention as being a step function in NLP capabilities, though at the time it was too early to tell just how important it would become.
One of the first well-known versions of the attention function was called Bahdanau attention, or additive attention. Although it wasn’t the final attention function used in LLMs, the original paper provides valuable insights into the intuition behind attention. The paper is well written, so I highly suggest implementing the model in the paper from scratch to get a feel for the attention mechanism outside of the context of transformers.
Additive Attention in RNNs
Neural Machine Translation by Jointly Learning to Align and Translate
https://arxiv.org/abs/1409.0473
This blog post may also be helpful as an illustration of the attention mechanism:
The Illustrated Attention Mechanism
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
After understanding attention, go through part 3 of the PyTorch RNN tutorials to see a full implementation of an encoder-decoder attention-based RNN architecture and train it end-to-end.
Attention-based Seq2Seq Translation
NLP From Scratch: Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
Extra Resources
This blog post is mainly about LLM research, so I’ll stop here with pre-transformer topics. If you’re interested, however, here are a few resources I’d recommend as extra reading.
- https://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Part 1: The Transformer
One of the most cited academic papers in existence, and still the best starting point for learning about transformers, is the OG Attention is All You Need paper.
Attention Is All You Need
The paper that kicked off the LLM and Transformer model revolution
https://arxiv.org/abs/1706.03762
This will likely be a paper that’s difficult to understand unless you attempt to implement the transformer itself in Pytorch, from scratch. There are many tutorials for this but start by trying to recreate elements of its architecture yourself based solely on the paper. Try not to use references other than the RNN tutorials earlier to understand the shift from RNNs to Transformers and the implementation differences. Looking at the shapes of all the input/output tensors can be extremely helpful to understand what’s going on internally.
Again, Jay Alammar provides a well-known blog post with illustrations to demonstrate the internals of transformers:
The Illustrated Transformer
Another great illustrative blog post by Jay Allamar, this time on the Transformer architecture
https://jalammar.github.io/illustrated-transformer/
After implementing the transformer yourself, you can try recreating it using an implementation from Harvard’s NLP group. I don’t agree with a lot of their software design choices, but it never hurts to see another implementation. You can also use the code as a reference in case you get stuck on implementing your version of the transformer.
The Annotated Transformer
A rehash of 'Attention is all You Need', but with annotations for PyTorch implementations of various blocks introduced in the paper
https://nlp.seas.harvard.edu/annotated-transformer/
Take your time on this part. Deeply understanding each part of the original transformer will lay the groundwork for a better understanding of modern LLM research, and is well worth the time. Today’s LLMs look strikingly similar to the content of the original paper, despite 8+ years passing between.
Extra Resources
Implementation Details
While it’s important to be able to code a transformer alongside its data loading and training loop from scratch, we’ll usually never use a naive self-written implementation in practice. PyTorch offers modules for each component of a transformer, all the way up to an entire transformer module in its entirety. These are useful to know about and have many useful optimizations under the hood. Just to remember these, I recommend making a copy of your original transformer implementation, then slowly substituting in each piece from the PyTorch implementations and seeing the speedups, starting with the Encoder/Decoder Layers, the Encoder/Decoder components themselves, and finally a full Transformer component by itself.
TransformerEncoderLayer PyTorch Module
TransformerEncoderLayer — PyTorch 2.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html
TransformerDecoderLayer PyTorch Module
TransformerDecoderLayer — PyTorch 2.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html
TransformerEncoder PyTorch Module
TransformerEncoder — PyTorch 2.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html
TransformerDecoder PyTorch Module
TransformerDecoder — PyTorch 2.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html
Transformer PyTorch Module
Transformer — PyTorch 2.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html
Part 2: Past the Original Transformer
While the original transformer was a groundbreaking work, it took a while before it gained widespread adoption in the NLP community, and even longer until it became popularized outside academia (ChatGPT came out 5 years afterward!). Along the way, there were a lot of attempts to increase the power of the transformer, modify its architecture, training scheme, usage, and more! It’s extremely important to understand how we bridged the gap between the transformer released in 2017 and the deluge of papers focusing on its capabilities starting in 2020 and afterward, and there are a few works that are commonly referenced when discussing this transitory period.
Make sure you familiarize yourself with the patterns in these works, as you’ll notice themes that come up again and again in modern research and are key to understanding the current landscape. I’ll also include key benchmark papers to help with understanding the results reported in these papers. Feel free to skim these papers, all that’s needed for the most part is a surface-level understanding of each eval dataset to get a gist of where these models were getting good and what academics were focused on improving against.
Early Derivatives
While the original transformer used an encoder-decoder architecture to model its downstream task of translation, it became apparent that separating these two aspects (the encoder and the decoder) opened up a world of possibilities for what a transformer layer could be used for, and greatly widened the scope of tasks that the transformer architecture was applied to.
While some later works attempted to unify these tasks into a larger encoder-decoder architecture instead of separating them, today’s model landscape treats these separately. Embedding models for example are typically used for semantic understanding, classification, and information retrieval. Decoder models on the other hand are the main generators in language modeling, producing novel text, and are sometimes just called Language Models (LMs). Read the following papers to understand the distinction, its origin, and why this split has endured in today’s well-known models.
First, we’ll introduce BERT and its successor, RoBERTa. These were the first family of popular text embedding models to use the transformer architecture. BERT is an extremely foundational paper for modern NLP applications. RoBERTa is a great paper because they just juiced up BERT with more data and compute and got better performance than many other works claiming to introduce novel techniques on top of BERT. You’ll see that this is a common theme in modern AI research.
We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. These results […] raise questions about the source of recently reported improvements.
— RoBERTa paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT Paper
https://arxiv.org/abs/1810.04805
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa Paper
https://arxiv.org/abs/1907.11692
Key Benchmarks:
Next is the GPT, or the Generative Pre-trained Transformer. This is the first version of what eventually became ChatGPT / GPT-4, and is certainly an interesting read to learn about the origin of the models we use today.
Improving Language Understanding by Generative Pre-Training
Generative Pre-trained Transformer (GPT 1) Paper
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Advancing Encoder-Decoder Models
Even with the advances found by splitting up the original transformer into encoder-only and decoder-only models, later attempts tried to improve performance by reuniting these two sides of the original model. There are still two well-known models that have come out of these efforts and are still sometimes used today as open-source benchmarks - BART and T5. It’s interesting to continue reading through these earlier works to see the evolution of ideas going from the original transformer, to BERT and GPT, to T5 and BART, and then back to GPT and BERT variants.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART Paper
https://arxiv.org/abs/1910.13461
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 Paper
https://arxiv.org/abs/1910.10683
A quick add-on - Transformer-XL is an interesting transformer variant that introduced recurrence and relative positional embeddings into the transformer literature. While it was never truly SOTA, it’s an important read that introduces many ideas still relevant today.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-XL paper
https://arxiv.org/abs/1901.02860
Key Benchmarks and Metrics:
Decoding Strategies
While not related to any specific transformer variant or LLM, understanding how tokens are actually generated from the transformer’s probabilistic output is an extremely fundamental building block. The term for this transition from a probability distribution to output token(s) is called decoding and can be done in a variety of ways. I recommend taking a minute to read this high-level overview of the most common decoding strategies modern models implement.
Decoding Strategies that You Need to Know for Response Generation
Blog post on decoding strategies
https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc
Scaling up the Generative Pre-trained Transformer
Obligatory inclusion of GPT-2 and GPT-3 papers.
Honestly, I don’t think the GPT-2 and GPT-3 papers are particularly informative, but I’m including them for the sake of completeness more than anything.
Language Models are Unsupervised Multitask Learners
GPT-2 Paper
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Language Models are Few-Shot Learners
GPT-3 Paper
https://arxiv.org/abs/2005.14165
Scaling Up [in General]
A lot of work in 2021-2022 focused on scaling up parameter counts. There are papers showing power law relationships between parameter count and transformer capabilities, but these papers demonstrate massive scale (surprise, they’re from Google!) and discuss these scaling laws in more detail. The first of these papers is PaLM, which was Google’s first foray into publicly available LLMs (predating Bard, which predated Gemini).
Scaling Language Modeling with Pathways
PaLM Paper
https://arxiv.org/abs/2204.02311
Another method that was used to scale parameter counts is using a Mixture of Experts model (MoE), which has been hypothesized to be the architecture behind the original GPT-4 model. While the model below is fairly insignificant, the ideas introduced in the paper are a good intro to MoE models.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers (MoE) Paper
https://arxiv.org/abs/2101.03961
The final paper of this section is very well-known and is often referenced when debating modern scaling laws and the idea that we’ve run out of data to train new models. In the Chinchilla paper, Google demonstrates that ideas from papers demonstrating a power-law relationship between parameters and performance were incomplete and that adding dataset size as an axis helps fill this gap. Many follow-up papers from Deepmind used the Chinchilla family as base models to test new techniques related to LLMs for 1-2 years, making these models relatively long-lived in their usage.
Training Compute-Optimal Large Language Models
Chinchilla Paper
https://arxiv.org/abs/2203.15556
Key Benchmarks:
These benchmarks are useful to take a look at since many of these are still used today.
- [Important] MMLU: Measuring Massive Multitask Language Understanding
- BIG-Bench: Beyond the Imitation Game Benchmark
- WinoGrande
- HellaSwag
- BoolQ
- GMS8K: Grade School Math 8K
- [Important] MATH: Measuring Mathematical Problem Solving
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
The Rise of the Embedding Model
Embedding models have been instrumental in the rise of LLMs and agents, and are the unsung heroes of the AI revolution. Why? Turns out the simple concept of being able to represent any amount of text in the form of a vector is extremely powerful and has been powering Google’s recommendation engine, Information Retrieval systems, Classification, and more for years before LLMs came onto the scene. Transformer-based embedding models became the new norm after “Attention is All you Need”, but many of the core ideas followed their own evolutionary path slightly disjoint from the research paths taken for generative models. How did we get here? Read the original Word2Vec paper to find out.
Efficient Estimation of Word Representations in Vector Space
Word2Vec paper
https://arxiv.org/abs/1301.3781
Now we can transform a word into a vector, but how do we go even further and transform multiple words, sentences, and even paragraphs into vectors? After all, embedding entire essays into an N by M block of numbers doesn’t seem like the most efficient way to go…
This is where sentence-BERT came in, taking the ideas we saw earlier in BERT and making them much more usable by enabling us to embed any amount of text into a single vector.
Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT paper
https://arxiv.org/abs/1908.10084
Sentence embeddings are cool, but now we want to learn how we use them with LLMs! The answer is known as Retrieval-Augmented Generation (RAG), a technique that is commonly used today to provide LLMs with any external information required to fully respond to a query. When using LLMs with web search, over a database, or over your files, RAG is happening in the background to inform the LLM of the context it needs to answer your query. Below is the original RAG paper, which while a little verbose, is still a good read for those interested in learning about the art of RAG.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Retrieval-Augmented Generation (RAG) introductory paper
https://arxiv.org/abs/2005.11401
Next, I’ll include the concept of re-ranking in a paper dubbed Lost in the Middle, which explores how LLMs read the contents of their context window and finds some extremely interesting and relevant results for how we should utilize RAG within an LLM’s input.
Lost in the Middle: How Language Models Use Long Contexts
Lost in the Middle effect and re-ranking motivating paper
https://arxiv.org/abs/2307.03172
Finally, a cool paper showing how decoder-based models that were trained to generate text can actually be adopted into powerful embedding models. This is a relatively new idea but has been gaining popularity with real performance implications for modern uses of embedding models.
Large Language Models Are Secretly Powerful Text Encoders
LLM2Vec Paper
https://arxiv.org/abs/2404.05961
As a quick bonus, I’ll throw in a research paper I wrote on embedding models! My co-author and I show that today’s widely available embedding models display clear positional bias, meaning that they prioritize and overweight content at the beginning of their textual input compared to the context at the end of their textual input. This has implications for chunking, information retrieval, and more, where this inductive bias may not be the most welcome for performance.
Quantifying Positional Biases in Text Embedding Models
Embedding models display clear positional biases, with implications for chunking, information retrieval, and more
https://arxiv.org/abs/2412.15241
Key Benchmarks:
Applying Transformers to Vision Tasks
Surprisingly, the same architecture that powers LLMs does extremely well on tasks that take in some sort of visual input (photo, video) as input. While these models don’t do as well as dedicated Computer Vision (CV) models, combining the language capabilities of transformers with visual input is extremely powerful for tasks such as captioning and text-to-image generation.
A useful paper to start with here is the Vision Transformer. This paper is fairly straightforward to implement on top of a regular transformer, so I would encourage attempting to create this model along with its data pipeline after a first pass.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer Paper
https://arxiv.org/abs/2010.11929
One of the most notable works in this vein is CLIP, which arguably kick-started the idea of using transformers to combine both visual and text-based inputs. CLIP is still used today as a reference and most visual LLMs (VLLMs) are built on top of ideas introduced in the CLIP paper.
Learning Transferable Visual Models From Natural Language Supervision
CLIP Paper
https://arxiv.org/abs/2103.00020
To round this section out with some modern research, we’ll mention LLaVA, which was one of the first open-sources VLLMs fine-tuned for instruction following.
Visual Instruction Tuning
LLaVA Paper
https://arxiv.org/abs/2304.08485
Finally, we mention the DALL-E series of models, which compose OpenAI’s image generation offering within ChatGPT.
Zero-Shot Text-to-Image Generation
DALL-E 1 Paper
https://arxiv.org/abs/2102.12092
Hierarchical Text-Conditional Image Generation with CLIP Latents
DALL-E 2 Paper
https://arxiv.org/abs/2204.06125
Improving Image Generation with Better Captions
DALL-E 3 Paper
https://cdn.openai.com/papers/dall-e-3.pdf
Applications and Honorable Mentions
The following papers are important when applying LLMs to real-world use cases but don’t neatly fit into any of the earlier categories or any specific storyline.
The first is Codex, the first major model trained on code. As you can imagine from the proliferation of LLMs in coding, the ideas presented in this paper are extremely relevant and laid the foundation for many of the ideas around code-based training used today.
Evaluating Large Language Models Trained on Code
Codex paper
https://arxiv.org/abs/2107.03374
The second is the Chain of Thought paper. This paper is relatively straightforward, and just taking a look at the abstract, the figures, and the introduction is enough to get the gist. TLDR; telling a model to explain its reasoning tends to result in better performance. Why? We’re still figuring that out.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought paper
https://arxiv.org/abs/2201.11903
Finally is the paper for Whisper, OpenAI’s voice to text model. Whisper was a step function above anything else when it came out and continues to be the base for SOTA voice-to-text models.
Robust Speech Recognition via Large-Scale Weak Supervision
Whisper paper
https://arxiv.org/abs/2212.04356
Key Benchmarks:
Alternative Architectures
While transformers are still the de-facto “winning” architecture for LLMs and their applications, there are some attempts to take a new class of models dubbed state space models and bring them into the limelight. Notably, Cartesia AI is leading these efforts, founded by the creators of some of the most popular state-space models such as S4 and Mamba. While these are interesting to learn about, it’s still unclear whether these types of models will displace transformers as the dominant architecture. Specifically, however, these models excel at extremely long-context tasks and have specific use cases (albeit few) where they perform better than the bigger transformer models today.
Efficiently Modeling Long Sequences with Structured State Spaces
S4 state space model paper
https://arxiv.org/abs/2111.00396
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba state space model paper
https://arxiv.org/abs/2312.00752
Extra Resources
Learning about these powerful models is important, but it’s also helpful to know how to go about using these in your code for real-world tasks. Many of these models have publicly available code and weights, most of which are hosted on HuggingFace. HuggingFace is the go-to repository for accessing and interacting with ML/AI datasets and models, and it’s well worth spending some time to create an account and to try poking around to familiarize yourself with the UI.
Follow me on HuggingFace: https://huggingface.co/sgoel9
The easiest way today to use a modern LLM is HuggingFace’s transformers library. They have a great course to learn the basics of the library. It’s not too long, so I would recommend going through it to learn how to load and generate an output from any of the open-source models we’ve covered.
Introduction - Hugging Face NLP Course
HuggingFace's intro course on NLP
https://huggingface.co/learn/nlp-course/chapter1/1
Part 3: Training Methods and Pipelines
There are 3 main steps to the training pipeline for modern LLMs such as GPT-4 and Claude. These include:
- Pretraining: Unsupervised learning on next-word prediction. Throwing internet-scale datasets at LLMs to get them started and give them a solid understanding of natural language.
- Supervised Fine-Tuning: Similar to pretraining, we train the LLM on next-word prediction, but we use texts and data closer to the output type we want to see from the LLM. This typically involves more high-quality data that is curated, such as news articles, guides, etc… Much better quality than the stuff you see in pretraining.
- Alignment: This is a pretty broad term. Typically RL is used to align the output of the LLM with human values, responding to instructions instead of just optimizing for next-word prediction, or whatever you want to “align” the AI to. Recent research tries to accomplish this without RL.
You can find an amazing overview of this pipeline in Chip Huyen’s blog, which I highly recommend.
RLHF: Reinforcement Learning from Human Feedback
Chip Huyen's blog post on RLHF
https://huyenchip.com/2023/05/02/rlhf.html
Pretraining
There aren’t too many works related to pretraining since most of the work required at this stage is related to large-scale web scraping and data quality. A nice overview of the type of thinking required at this stage can be found in the Fineweb paper by HuggingFace
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The FineWeb Dataset paper
https://arxiv.org/abs/2406.17557
Supervised Fine-Tuning
A great example of the use cases for Supervised Fine-Tuning (SFT) can be seen in the InstructGPT paper. Here, researchers fine-tuned GPT-3 on an instruction-following dataset, aligning the model’s responses with the expected output from a helpful assistant, rather than a next token generator, which is what an LLM would be right after the pretraining stage.
Training language models to follow instructions with human feedback
InstructGPT Paper
https://arxiv.org/abs/2203.02155
Alignment
Lastly, we’ll look at a few papers and works around alignment, which is one of the most varied parts of the LLM training pipeline. There are a few popular methods to induce alignment, but the most well-known, and what led to ChatGPT’s breakthrough performance, is Reinforcement Learning from Human Feedback (RLHF). Read Chip Huyen’s article on it if you haven’t yet, to understand the process of training a model using RLHF.
RLHF: Reinforcement Learning from Human Feedback
Chip Huyen's blog post on RLHF
https://huyenchip.com/2023/05/02/rlhf.html
I consider this next paper somewhat optional, but a lab out of Stanford used some math to show that one can mimic RLHF without a preference model to make training faster and more stable. It’s still a debate if Direct Preference Optimization (DPO) actually leads to better results than RLHF, but it’s become a somewhat canon part of the literature now.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct Preference Optimization paper
https://arxiv.org/abs/2305.18290
Another interesting development is the rise of Reinforcement Learning from AI Feedback (RLAIF), which as it sounds like, is just RLHF but with the replacement of a human with a sufficiently powerful AI model such as GPT-4. The idea rose in popularity after the following paper came out, demonstrating that using GPT-4 as a judge to mimic a human in LMSYS’s Chatbot Arena led to similar performance as humans. The broader idea of using an LLM as a decision-maker in place of a human has come to be known as LLM-as-a-Judge.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
LLM-as-a-Judge and Chatbot Arena paper
https://arxiv.org/abs/2306.05685
Finally, we’ll touch on Anthropic’s Constitutional AI paper, where their approach to alignment can be described as something along the lines of a massive system prompt telling their model to not be evil. An interesting read for sure.
Constitutional AI: Harmlessness from AI Feedback
Anthropic paper on Constitutional AI
https://arxiv.org/abs/2212.08073
Positional Encoding
Finally, it’s important to have some understanding of positional encoding methods, as these are ever-present throughout the model training pipeline and have evolved alongside LLMs themselves.
Positional Encoding is the science behind how the LLM is able to derive an ordering from its input. The actual internals of a transformer model are position-invariant, so we must inject position awareness into its input, such that it would have some way to know that in the phrase “the dog is brown”, the word “dog” comes before the word “brown”.
The original Attention is All you Need paper used sinusoidal positional encoding, but this was quickly shown to be suboptimal when compared to alternatives. For a nice overview of positional encoding, read the relevant section of the following post by Lilian Weng. Feel free to read the entire post though, it’s extremely well-written and very informational but very dense.
The Transformer Family Version 2.0
Lilian Weng's post on Transformers, but with a helpful section on positional encoding
https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/#positional-encoding
While the following two papers were mentioned in the post above, they are still widely used when training LLMs today, so it’s worth reading the papers themselves to get a stronger understanding of the intuition behind how today’s models are trained.
Enhanced Transformer with Rotary Position Embedding
Rotary Positional Encoding paper
https://arxiv.org/abs/2104.09864
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
ALiBi Positional Encoding paper
https://arxiv.org/abs/2108.12409
Part 4: Inference and Training Optimization
While the transformer was a big improvement over its predecessor the RNN due in large part to its increased efficiency and parallelizable nature, the quadratic scaling of an LLM’s processing with respect to its input is still a bottleneck that many works have tried to improve upon. There are a ton of directions that LLM inference-time and training-time optimization have taken, with the most notable focuses being non-global attention, multi-head attention optimization, quantization, efficient fine-tuning, distillation, and more.
Non-Global Attention
While the original transformer used a global attention mechanism that had all tokens attend to all tokens, there has been a decent amount of work around relaxing this requirement while retaining model performance. The following three papers represent the early works in this vein of optimization and are interesting to read to solidify one’s understanding of attention, multi-head attention, and context utilization. It’s unknown what attention maps modern models use, but many of the best open-source models utilize variants of the ideas presented here.
Generating Long Sequences with Sparse Transformers
Sparse Transformer Paper
https://arxiv.org/abs/1904.10509
The Long-Document Transformer
Longformer Paper
https://arxiv.org/abs/2004.05150
Transformers for Longer Sequences
Big Bird Paper
https://arxiv.org/abs/2007.14062
Multi-Head Attention Optimization
A small detour and once again a technique used by many of the best open-source models today is multi-query attention and its successor, grouped-query attention. While non-global attention techniques focus on reducing the footprint of the model’s attention map, these works focus on reducing the model’s parameter count by reducing the cost of multi-head attention with a transformer block.
Fast Transformer Decoding: One Write-Head is All You Need
Multi-Query Attention paper
https://arxiv.org/abs/1911.02150
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Grouped Query Attention paper
https://arxiv.org/abs/2305.13245
Distillation
Distillation was born out of a pretty simple idea: What if you trained the outputs of a smaller model on the outputs of a bigger model? If the bigger model’s performance is better, maybe you could replicate or at least get close to its performance with a smaller, cheaper model by trying to have it “mimic” the larger model.
Turns out, this works pretty well in practice. Distillation was first popularized in a paper from HuggingFace, where they distilled BERT and were able to achieve roughly similar performance with a much smaller model.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
HuggingFace paper on DistilBERT
https://arxiv.org/abs/1910.01108
Distillation works for both embedding models and LLMs, with distillation from bigger-to-smaller LLMs having dramatic implications for cost-performance ratios. The next paper will discuss this process in LLMs and is a bit more up-to-date on modern distillation than the original HuggingFace paper.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
High-level paper on LLM distillation
https://arxiv.org/abs/2305.02301
The KV-Cache
While not as academic of a topic, the KV-cache is extremely important to making modern transformers work well. There unfortunately aren’t any good academic papers introducing the topic, but there are many blog posts available to help you get the gist of this powerful inference-time optimization technique.
Transformers KV Caching Explained
KV-caching is a powerful inference-time optimization technique for transformers
https://medium.com/@joaolages/kv-caching-explained-276520203249
Parameter-Efficient Fine Tuning
There’s a few ways to go about efficient fine-tuning of LLMs. The first is through brute-force prompting, which is often all that’s needed to get a ton of performance out of a language model. For more interesting techniques though, the next evolution in fine-tuning methods is called soft prompting, where optimized vectors are concatenated to the input after it is embedded and is ready to be sent through a transformer block.
Overview of Soft Prompting
HuggingFace's guide to Soft Prompting techniques
https://huggingface.co/docs/peft/conceptual_guides/prompting
While soft prompting is interesting, it’s actually not used very often in practice. What’s more common is a technique called LoRA, where low-rank adopters are added to the weight matrices of a neural network to achieve a “full” fine-tune without having to train the full number of parameters. If you’re interested in training a large LLM but don’t have a bunch of GPUs, LoRA is the way to go.
Low-Rank Adaptation of Large Language Models
Low-Rank Adaptation (LoRA) paper
https://arxiv.org/abs/2106.09685
Quantization
Quantization means “quantizing” the 32-bit representation of a number into a smaller number of bits. For example, 4-bit quantization involves taking a bunch of 32-bit numbers (the default number of bits in a number) and representing them only using 4 bits. In this case, the numbers in question are the LLM’s weights. As you can imagine, this makes storing the LLM 8 times easier (in the case of 4-bit quantization) and helps it generate tokens faster. The downside (and it’s a big one), is that it tends to make LLMs a lot dumber. There’s been a lot of research into how to best train LLMs that make use of quantization, however, and it’s been becoming very popular, with many new open-source models taking advantage of 16-bit quantization, and some starting to venture into the territory of 8-bit quantization. The blog post below by HuggingFace is a great primer into the pros and cons of quantization and is well worth a read.
A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes
Blog post by HuggingFace overviewing quantization
https://huggingface.co/blog/hf-bitsandbytes-integration
In addition to having a very helpful introductory blog post, HuggingFace also has a useful library called bitsandbytes
, which works well with models loaded from HuggingFace to load and use them with easy quantization. If you’re interested in experimenting with some quantization, this is a good place to start.
Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA
Blog post by HuggingFace introducing bitsandbytes, their library for quantization
https://huggingface.co/blog/4bit-transformers-bitsandbytes
Finally, QLoRA is an interesting extension of LoRA from the previous section, with the Q standing for Quantized. A cool example of how its possible to combine ideas in LLM research to come up with new innovative solutions to hard problems!
QLoRA: Efficient Finetuning of Quantized LLMs
Quantized LoRA paper
https://arxiv.org/abs/2305.14314
Hardware Optimizations
I’ll admit that I don’t have much knowledge about GPUs and CUDA, but even still I’ve taken time to read the Flash Attention paper, which is incredibly important for anyone even thinking about training LLMs. It’s a great read even without a hardware background, and a landmark paper in the field that’s worth knowing about at the very least.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention paper
https://arxiv.org/abs/2205.14135
Decoding Optimizations
One interesting optimization angle to consider is the bottleneck of LLMs only generating a single token at a time. What if instead, they were able to generate multiple tokens at a time, increasing the speed of response generation, all while keeping the performance of an LLM that generates one token at a time? That’s the main idea behind the following two papers, which introduce two techniques for this: speculative decoding, and lookahead decoding. While the latter is introduced as a successor to the former, speculative decoding is only recently finding its way into large open-source LLMs as a viable technique for faster generations.
Fast Inference from Transformers via Speculative Decoding
Speculative decoding paper
https://arxiv.org/abs/2211.17192
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
LMSYS paper on lookahead decoding
https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Part 5: Open-Source Advancements
There are a few important papers in the open-source LLM literature, many of which are tied to some of the state-of-the-art OS models used today.
LLaMA: Large Language Model Meta AI
While the name isn’t particularly creative, the LLaMA family of models is some of the most advanced open-source sets of LLMs available today, powered by Meta’s immense resources, AI talent, and commitment to open source. We’re currently on LLaMA 3, but it’s worth reading the LLaMA 1 and LLaMA 2 papers to get a sense of the evolution of these models and the techniques that came in and out of popularity throughout the models’ development.
LLaMA: Open and Efficient Foundation Language Models
LLaMA 1 Paper
https://arxiv.org/abs/2302.13971
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2 Paper
https://arxiv.org/abs/2307.09288
The Llama 3 Herd of Models
Llama 3 Paper
https://arxiv.org/abs/2407.21783
Mistral AI
Next up is the Mistral series of models, which have never been to SOTA, but are good to be aware of as they are still a relatively well-known lab and one of the few releasing frontier models.
Mistral 7B
Mistral 7B Paper
https://arxiv.org/abs/2310.06825
Mixtral of experts
Mistral AI blog post on their Mixtral of experts model
https://mistral.ai/news/mixtral-of-experts/
Phi
The Phi series of models by Microsoft is unique in that they are focused on maximizing performance at small parameter counts, a stark contrast from many other model providers trying to max out all possible scaling laws. Due to this unique direction, there are some interesting nuggets and research directions presented in these papers to make the most of smaller models.
Textbooks Are All You Need II: phi-1.5 technical report
Phi-1.5 Paper
https://arxiv.org/abs/2309.05463
Phi-4 Technical Report
Phi-4 Paper
https://arxiv.org/abs/2412.08905
Deepseek
Deepseek is a Chinese lab with models that are extremely strong in coding and math and are at the very least the strongest open source models in these areas.
DeepSeek-Coder: When the Large Language Model Meets Programming
DeepSeek-Coder Paper
https://arxiv.org/abs/2401.14196
DeepSeek-V3 Technical Report
DeepSeek-V3 Paper
https://arxiv.org/abs/2412.19437
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1 Paper
https://arxiv.org/abs/2501.12948
Closed-Source Technical Reports
While this section is about open-source models, I would be remiss to not mention iconic reports such as those about GPT-4, Claude, and Gemini, especially given their positions as the strongest models today.
GPT-4 Technical Report
GPT-4 Paper
https://arxiv.org/abs/2303.08774
The Claude 3 Model Family: Opus, Sonnet, Haiku
Claude 3 Paper
https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
Introducing Gemini 2.0: our new AI model for the agentic era
Gemini 2 Paper
https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/
Honorable Mentions
A few honorable mentions that are worth knowing about:
ModernBERT is a new take on BERT combining new knowledge, better data, and more compute.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT Paper
https://arxiv.org/abs/2412.13663
Apple Intelligence is an interesting series of models focused on being small enough to fit on-device, which brings its own set of challenges.
Apple Intelligence Foundation Language Models
Apple Intelligence Paper
https://arxiv.org/abs/2407.21075
Vicuna, an open-source chatbot released by LMSys, uses distillation techniques from conversations between GPT-4 and humans to achieve extremely strong performance. A great look into the power of distillation from a bigger model to a smaller model.
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality
Vicuna Paper by LMSYS
https://lmsys.org/blog/2023-03-30-vicuna/
Part 6: Agents
There isn’t much novel content to be covered here (in this post at least), since agents are more of a use case of LLMs rather than a novel development within themselves.
I would recommend starting with the Toolformer paper, which helped kick off the notion of LLMs using tools and functions to accomplish their goals. This idea eventually transformed into the modern-day notion of Agents, though even defining what that means is still hazy.
Toolformer: Language Models Can Teach Themselves to Use Tools
Toolformer Paper
https://arxiv.org/abs/2302.04761
Transitioning towards modern-day agent architectures, we have the ReAct paper, which while relatively straightforward, demonstrates the addition of chained responses to tool use to create an agent that can “reason” to take “actions”.
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct Paper
https://arxiv.org/abs/2210.03629
I’ll conclude the papers included in this section with Stanford’s landmark Generative Agents work, where researchers created an online town with all residents simulated by an individual LLM. Each “resident” had their own memory, time to reflect, and an awareness of their surroundings. Putting these all together, the LLM agents/residents interacted with the world and with each other, leading a cohesive life and developing storylines of their own. While the non-highlighted transcripts are a bit bland to read, this was truly a step function work in demonstrating what an LLM powered by history, memory, tool usage, and the ability to interact with other LLMs could do.
Generative Agents: Interactive Simulacra of Human Behavior
Generative Agents Paper
https://arxiv.org/abs/2304.03442
While not a paper, I can’t get away with not mentioning OpenAI’s o3 model. While not an agent in itself (as far as we know), the idea that scaling test-time compute will lead to reliable agents is taking hold in the literature and in the industry.
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
OpenAI o3 Paper
https://arcprize.org/blog/oai-o3-pub-breakthrough
Lastly, if you would like an overview of autonomous agents, Chip Huyen once again provides a great overview in her blog.
Agents
Chip Huyen's blog post on agents
https://huyenchip.com/2025/01/07/agents.html
Key Benchmarks:
Extra Resources
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- MemGPT: Towards LLMs as Operating Systems
Conclusion
Whew, this was a long post! The incredible thing, however, is that we’re just touching the surface of all the cool LLM research happening now. The number of AI-related papers published doubles every few months now, and with the surge of interest in LLMs over the last few years, is only going to keep increasing. There are so many unsolved problems, exciting opportunities, and new research directions to discover - good luck!
If this post helped you in any way or if you’ve caught up to speed and are interested in participating in new LLM research, I’d love to hear from you. Feel free to contact me anytime at sgoel9@berkeley.edu!