Getting Caught Up to Modern LLM Research

2025-01-15

Introduction
- Preamble
- Technical Prerequisites
Part 0: RNNs, Encoder-Decoder, and Attention
Part 1: The Transformer
- Extra Resources
Part 2: Past the Original Transformer
Part 3: Training Methods and Pipelines
Part 4: Inference and Training Optimization
Part 5: Open-Source Advancements
Part 6: Agents
- Extra Resources
Conclusion

Introduction

Catching up to LLM research is hard. It’s too easy to take shortcuts from half-baked Twitter posts and find yourself “knowing” about LLMs, but not necessarily understanding them. There’s a time and place for both, but if you’re looking to get involved in AI research, specifically LLM research, it’s incredibly important to do things the right way, also known as the hard way.

I decided to start doing AI research around a year ago, so I’ve personally struggled with finding the right resources and the right path to best learn the topics needed to get a grasp of the LLM research landscape. There are an innumerable number of topics, subtopics, niches, and ideas to explore, and I know that a subsetted guide introducing me to the field would have been extremely valuable. Well, here it is, the blog post I wish I would have had when starting my own journey.

Preamble

There are two parts to this collection, catered towards two audiences.

Audience one has a solid grasp of PyTorch and an at least surface-level understanding of deep neural networks up until CNNs and RNNs.

Audience two is more focused on learning LLM-specific topics and is willing to skip the nitty-gritty implementation details in the process.

If you have the background for it, meaning you have some programming experience and have taken some sort of intro to ML/AI course in the past, I highly recommend you go through the entirety of the resources listed in this post, as honing your PyTorch and programming skills alongside a theoretical understanding of AI will have the highest payoff. If not or if you’re looking to move faster, no worries. Feel free to skip ahead to Part 1.

Disclaimer: The ordering and presentation of resources on this page are extremely biased to what I found worked for me. If you prefer learning another way, be aware of potential differences.

Technical Prerequisites

If you are already proficient in PyTorch, don’t want to program, or want to skip forward to learning about LLMs, feel free to move ahead to Part 1.

If you don’t have any experience with Neural Networks, 3Blue1Brown’s Neural Networks playlist is a good place to start gaining some intuition around basic deep learning.

If you have some programming experience but have no experience with Neural Networks, I recommend going through the first 4-5 videos of Andrej Karpathy’s Neural Networks: Zero to Hero course.

This is more optional, but if you’re not familiar with CNNs and RNNs and want to learn more, consider taking a look at the first half of UC Berkeley’s CS 182: Introduction to Artificial Intelligence or the first third of Stanford’s CS224N: Natural Language Processing.

If you have no familiarity with PyTorch but have some programming experience, I highly recommend going through this course on PyTorch. LLM research is heavily biased towards engineering skills and applied science, so having confidence in your PyTorch skills has an extremely high payoff.

But what is a neural network?

3Blue1Brown's Neural Networks playlist

https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Neural Networks: Zero to Hero

Andrej Karpathy's Neural Networks: Zero to Hero course

https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ

Intro to PyTorch

Make sure you can repicate the lessons from scratch before continuing!

https://github.com/mrdbourke/pytorch-deep-learning

Part 0: RNNs, Encoder-Decoder, and Attention

If you’re already familiar with RNNs, attention mechanisms, and encoder-decoder architectures, or if you want to dive straight into LLMs, feel free to skip this section. However, if you’re new to these concepts, the following resources will provide a solid foundation for understanding future work in transformers and LLMs.

Seq2Seq Implementation with RNNs

To grasp the basics of RNNs and how they’re implemented in PyTorch, I recommend working through the following two tutorials.

These tutorials will guide you through implementing both a classifier and a generative model using an RNN backbone. I highly recommend being able to recreate the code for each tutorial from scratch, including the training loop and data loading, without referring to the tutorial.

Classification with RNNs

NLP From Scratch: Classifying Names with a Character-Level RNN — PyTorch Tutorials

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Generation with RNNs

NLP From Scratch: Generating Names with a Character-Level RNN — PyTorch Tutorials

https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html

Encoder-Decoder Architecture

Once you have a solid grasp of RNNs, the next step is to understand the shift towards encoder-decoder architectures. These architectures first emerged in the context of RNNs before being further popularized by transformers. Look over the following papers to understand the basics of the encoder-decoder architecture.

Encoder-Decoder Architecture using RNNs

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

https://arxiv.org/abs/1406.1078

Encoder-Decoder Models for Seq2Seq Tasks

Sequence to Sequence Learning with Neural Networks

https://arxiv.org/abs/1409.3215

The Attention Mechanism

The next building block towards modern LLMs is having a solid understanding of the attention mechanism, one of the key ingredients that power the performance we see in modern LLMs. Attention was first popularized within the NLP community for use with RNNs before being adapted for transformers. The initial transformer paper settled on a version of scaled dot-product attention, but it’s worth exploring the evolution of attention mechanisms to gain deeper understanding. Interestingly enough, a few notable individuals in the deep learning community called out attention as being a step function in NLP capabilities, though at the time it was too early to tell just how important it would become.

One of the first well-known versions of the attention function was called Bahdanau attention, or additive attention. Although it wasn’t the final attention function used in LLMs, the original paper provides valuable insights into the intuition behind attention. The paper is well written, so I highly suggest implementing the model in the paper from scratch to get a feel for the attention mechanism outside of the context of transformers.

Additive Attention in RNNs

Neural Machine Translation by Jointly Learning to Align and Translate

https://arxiv.org/abs/1409.0473

This blog post may also be helpful as an illustration of the attention mechanism:

The Illustrated Attention Mechanism

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

After understanding attention, go through part 3 of the PyTorch RNN tutorials to see a full implementation of an encoder-decoder attention-based RNN architecture and train it end-to-end.

Attention-based Seq2Seq Translation

NLP From Scratch: Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Extra Resources

This blog post is mainly about LLM research, so I’ll stop here with pre-transformer topics. If you’re interested, however, here are a few resources I’d recommend as extra reading.

Part 1: The Transformer

One of the most cited academic papers in existence, and still the best starting point for learning about transformers, is the OG Attention is All You Need paper.

Attention Is All You Need

The paper that kicked off the LLM and Transformer model revolution

https://arxiv.org/abs/1706.03762

This will likely be a paper that’s difficult to understand unless you attempt to implement the transformer itself in Pytorch, from scratch. There are many tutorials for this but start by trying to recreate elements of its architecture yourself based solely on the paper. Try not to use references other than the RNN tutorials earlier to understand the shift from RNNs to Transformers and the implementation differences. Looking at the shapes of all the input/output tensors can be extremely helpful to understand what’s going on internally.

Again, Jay Alammar provides a well-known blog post with illustrations to demonstrate the internals of transformers:

The Illustrated Transformer

Another great illustrative blog post by Jay Allamar, this time on the Transformer architecture

https://jalammar.github.io/illustrated-transformer/

After implementing the transformer yourself, you can try recreating it using an implementation from Harvard’s NLP group. I don’t agree with a lot of their software design choices, but it never hurts to see another implementation. You can also use the code as a reference in case you get stuck on implementing your version of the transformer.

The Annotated Transformer

A rehash of 'Attention is all You Need', but with annotations for PyTorch implementations of various blocks introduced in the paper

https://nlp.seas.harvard.edu/annotated-transformer/

Take your time on this part. Deeply understanding each part of the original transformer will lay the groundwork for a better understanding of modern LLM research, and is well worth the time. Today’s LLMs look strikingly similar to the content of the original paper, despite 8+ years passing between.

Extra Resources

Implementation Details

While it’s important to be able to code a transformer alongside its data loading and training loop from scratch, we’ll usually never use a naive self-written implementation in practice. PyTorch offers modules for each component of a transformer, all the way up to an entire transformer module in its entirety. These are useful to know about and have many useful optimizations under the hood. Just to remember these, I recommend making a copy of your original transformer implementation, then slowly substituting in each piece from the PyTorch implementations and seeing the speedups, starting with the Encoder/Decoder Layers, the Encoder/Decoder components themselves, and finally a full Transformer component by itself.

TransformerEncoderLayer PyTorch Module

TransformerEncoderLayer — PyTorch 2.1 documentation

https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html

TransformerDecoderLayer PyTorch Module

TransformerDecoderLayer — PyTorch 2.1 documentation

https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html

TransformerEncoder PyTorch Module

TransformerEncoder — PyTorch 2.1 documentation

https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html

TransformerDecoder PyTorch Module

TransformerDecoder — PyTorch 2.1 documentation

https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html

Transformer PyTorch Module

Transformer — PyTorch 2.1 documentation

https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

Part 2: Past the Original Transformer

While the original transformer was a groundbreaking work, it took a while before it gained widespread adoption in the NLP community, and even longer until it became popularized outside academia (ChatGPT came out 5 years afterward!). Along the way, there were a lot of attempts to increase the power of the transformer, modify its architecture, training scheme, usage, and more! It’s extremely important to understand how we bridged the gap between the transformer released in 2017 and the deluge of papers focusing on its capabilities starting in 2020 and afterward, and there are a few works that are commonly referenced when discussing this transitory period.

Make sure you familiarize yourself with the patterns in these works, as you’ll notice themes that come up again and again in modern research and are key to understanding the current landscape. I’ll also include key benchmark papers to help with understanding the results reported in these papers. Feel free to skim these papers, all that’s needed for the most part is a surface-level understanding of each eval dataset to get a gist of where these models were getting good and what academics were focused on improving against.

Early Derivatives

While the original transformer used an encoder-decoder architecture to model its downstream task of translation, it became apparent that separating these two aspects (the encoder and the decoder) opened up a world of possibilities for what a transformer layer could be used for, and greatly widened the scope of tasks that the transformer architecture was applied to.

While some later works attempted to unify these tasks into a larger encoder-decoder architecture instead of separating them, today’s model landscape treats these separately. Embedding models for example are typically used for semantic understanding, classification, and information retrieval. Decoder models on the other hand are the main generators in language modeling, producing novel text, and are sometimes just called Language Models (LMs). Read the following papers to understand the distinction, its origin, and why this split has endured in today’s well-known models.

First, we’ll introduce BERT and its successor, RoBERTa. These were the first family of popular text embedding models to use the transformer architecture. BERT is an extremely foundational paper for modern NLP applications. RoBERTa is a great paper because they just juiced up BERT with more data and compute and got better performance than many other works claiming to introduce novel techniques on top of BERT. You’ll see that this is a common theme in modern AI research.

We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. These results […] raise questions about the source of recently reported improvements.

— RoBERTa paper

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT Paper

https://arxiv.org/abs/1810.04805

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa Paper

https://arxiv.org/abs/1907.11692

Key Benchmarks:

Next is the GPT, or the Generative Pre-trained Transformer. This is the first version of what eventually became ChatGPT / GPT-4, and is certainly an interesting read to learn about the origin of the models we use today.

Improving Language Understanding by Generative Pre-Training

Generative Pre-trained Transformer (GPT 1) Paper

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Advancing Encoder-Decoder Models

Even with the advances found by splitting up the original transformer into encoder-only and decoder-only models, later attempts tried to improve performance by reuniting these two sides of the original model. There are still two well-known models that have come out of these efforts and are still sometimes used today as open-source benchmarks - BART and T5. It’s interesting to continue reading through these earlier works to see the evolution of ideas going from the original transformer, to BERT and GPT, to T5 and BART, and then back to GPT and BERT variants.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART Paper

https://arxiv.org/abs/1910.13461

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

T5 Paper

https://arxiv.org/abs/1910.10683

A quick add-on - Transformer-XL is an interesting transformer variant that introduced recurrence and relative positional embeddings into the transformer literature. While it was never truly SOTA, it’s an important read that introduces many ideas still relevant today.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL paper

https://arxiv.org/abs/1901.02860

Key Benchmarks and Metrics:

Decoding Strategies

While not related to any specific transformer variant or LLM, understanding how tokens are actually generated from the transformer’s probabilistic output is an extremely fundamental building block. The term for this transition from a probability distribution to output token(s) is called decoding and can be done in a variety of ways. I recommend taking a minute to read this high-level overview of the most common decoding strategies modern models implement.

Decoding Strategies that You Need to Know for Response Generation

Blog post on decoding strategies

https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc

Scaling up the Generative Pre-trained Transformer

Obligatory inclusion of GPT-2 and GPT-3 papers.

Honestly, I don’t think the GPT-2 and GPT-3 papers are particularly informative, but I’m including them for the sake of completeness more than anything.

Language Models are Unsupervised Multitask Learners

GPT-2 Paper

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Language Models are Few-Shot Learners

GPT-3 Paper

https://arxiv.org/abs/2005.14165

Scaling Up [in General]

A lot of work in 2021-2022 focused on scaling up parameter counts. There are papers showing power law relationships between parameter count and transformer capabilities, but these papers demonstrate massive scale (surprise, they’re from Google!) and discuss these scaling laws in more detail. The first of these papers is PaLM, which was Google’s first foray into publicly available LLMs (predating Bard, which predated Gemini).

Scaling Language Modeling with Pathways

PaLM Paper

https://arxiv.org/abs/2204.02311

Another method that was used to scale parameter counts is using a Mixture of Experts model (MoE), which has been hypothesized to be the architecture behind the original GPT-4 model. While the model below is fairly insignificant, the ideas introduced in the paper are a good intro to MoE models.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers (MoE) Paper

https://arxiv.org/abs/2101.03961

The final paper of this section is very well-known and is often referenced when debating modern scaling laws and the idea that we’ve run out of data to train new models. In the Chinchilla paper, Google demonstrates that ideas from papers demonstrating a power-law relationship between parameters and performance were incomplete and that adding dataset size as an axis helps fill this gap. Many follow-up papers from Deepmind used the Chinchilla family as base models to test new techniques related to LLMs for 1-2 years, making these models relatively long-lived in their usage.

Training Compute-Optimal Large Language Models

Chinchilla Paper

https://arxiv.org/abs/2203.15556

Key Benchmarks:

These benchmarks are useful to take a look at since many of these are still used today.

The Rise of the Embedding Model

Embedding models have been instrumental in the rise of LLMs and agents, and are the unsung heroes of the AI revolution. Why? Turns out the simple concept of being able to represent any amount of text in the form of a vector is extremely powerful and has been powering Google’s recommendation engine, Information Retrieval systems, Classification, and more for years before LLMs came onto the scene. Transformer-based embedding models became the new norm after “Attention is All you Need”, but many of the core ideas followed their own evolutionary path slightly disjoint from the research paths taken for generative models. How did we get here? Read the original Word2Vec paper to find out.

Efficient Estimation of Word Representations in Vector Space

Word2Vec paper

https://arxiv.org/abs/1301.3781

Now we can transform a word into a vector, but how do we go even further and transform multiple words, sentences, and even paragraphs into vectors? After all, embedding entire essays into an N by M block of numbers doesn’t seem like the most efficient way to go…

This is where sentence-BERT came in, taking the ideas we saw earlier in BERT and making them much more usable by enabling us to embed any amount of text into a single vector.

Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT paper

https://arxiv.org/abs/1908.10084

Sentence embeddings are cool, but now we want to learn how we use them with LLMs! The answer is known as Retrieval-Augmented Generation (RAG), a technique that is commonly used today to provide LLMs with any external information required to fully respond to a query. When using LLMs with web search, over a database, or over your files, RAG is happening in the background to inform the LLM of the context it needs to answer your query. Below is the original RAG paper, which while a little verbose, is still a good read for those interested in learning about the art of RAG.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Retrieval-Augmented Generation (RAG) introductory paper

https://arxiv.org/abs/2005.11401

Next, I’ll include the concept of re-ranking in a paper dubbed Lost in the Middle, which explores how LLMs read the contents of their context window and finds some extremely interesting and relevant results for how we should utilize RAG within an LLM’s input.

Lost in the Middle: How Language Models Use Long Contexts

Lost in the Middle effect and re-ranking motivating paper

https://arxiv.org/abs/2307.03172

Finally, a cool paper showing how decoder-based models that were trained to generate text can actually be adopted into powerful embedding models. This is a relatively new idea but has been gaining popularity with real performance implications for modern uses of embedding models.

Large Language Models Are Secretly Powerful Text Encoders

LLM2Vec Paper

https://arxiv.org/abs/2404.05961

As a quick bonus, I’ll throw in a research paper I wrote on embedding models! My co-author and I show that today’s widely available embedding models display clear positional bias, meaning that they prioritize and overweight content at the beginning of their textual input compared to the context at the end of their textual input. This has implications for chunking, information retrieval, and more, where this inductive bias may not be the most welcome for performance.

Quantifying Positional Biases in Text Embedding Models

Embedding models display clear positional biases, with implications for chunking, information retrieval, and more

https://arxiv.org/abs/2412.15241

Key Benchmarks:

Applying Transformers to Vision Tasks

Surprisingly, the same architecture that powers LLMs does extremely well on tasks that take in some sort of visual input (photo, video) as input. While these models don’t do as well as dedicated Computer Vision (CV) models, combining the language capabilities of transformers with visual input is extremely powerful for tasks such as captioning and text-to-image generation.

A useful paper to start with here is the Vision Transformer. This paper is fairly straightforward to implement on top of a regular transformer, so I would encourage attempting to create this model along with its data pipeline after a first pass.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer Paper

https://arxiv.org/abs/2010.11929

One of the most notable works in this vein is CLIP, which arguably kick-started the idea of using transformers to combine both visual and text-based inputs. CLIP is still used today as a reference and most visual LLMs (VLLMs) are built on top of ideas introduced in the CLIP paper.

Learning Transferable Visual Models From Natural Language Supervision

CLIP Paper

https://arxiv.org/abs/2103.00020

To round this section out with some modern research, we’ll mention LLaVA, which was one of the first open-sources VLLMs fine-tuned for instruction following.

Visual Instruction Tuning

LLaVA Paper

https://arxiv.org/abs/2304.08485

Finally, we mention the DALL-E series of models, which compose OpenAI’s image generation offering within ChatGPT.

Zero-Shot Text-to-Image Generation

DALL-E 1 Paper

https://arxiv.org/abs/2102.12092

Hierarchical Text-Conditional Image Generation with CLIP Latents

DALL-E 2 Paper

https://arxiv.org/abs/2204.06125

Improving Image Generation with Better Captions

DALL-E 3 Paper

https://cdn.openai.com/papers/dall-e-3.pdf

Applications and Honorable Mentions

The following papers are important when applying LLMs to real-world use cases but don’t neatly fit into any of the earlier categories or any specific storyline.

The first is Codex, the first major model trained on code. As you can imagine from the proliferation of LLMs in coding, the ideas presented in this paper are extremely relevant and laid the foundation for many of the ideas around code-based training used today.

Evaluating Large Language Models Trained on Code

Codex paper

https://arxiv.org/abs/2107.03374

The second is the Chain of Thought paper. This paper is relatively straightforward, and just taking a look at the abstract, the figures, and the introduction is enough to get the gist. TLDR; telling a model to explain its reasoning tends to result in better performance. Why? We’re still figuring that out.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought paper

https://arxiv.org/abs/2201.11903

Finally is the paper for Whisper, OpenAI’s voice to text model. Whisper was a step function above anything else when it came out and continues to be the base for SOTA voice-to-text models.

Robust Speech Recognition via Large-Scale Weak Supervision

Whisper paper

https://arxiv.org/abs/2212.04356

Key Benchmarks:

Alternative Architectures

While transformers are still the de-facto “winning” architecture for LLMs and their applications, there are some attempts to take a new class of models dubbed state space models and bring them into the limelight. Notably, Cartesia AI is leading these efforts, founded by the creators of some of the most popular state-space models such as S4 and Mamba. While these are interesting to learn about, it’s still unclear whether these types of models will displace transformers as the dominant architecture. Specifically, however, these models excel at extremely long-context tasks and have specific use cases (albeit few) where they perform better than the bigger transformer models today.

Efficiently Modeling Long Sequences with Structured State Spaces

S4 state space model paper

https://arxiv.org/abs/2111.00396

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba state space model paper

https://arxiv.org/abs/2312.00752

Extra Resources

Learning about these powerful models is important, but it’s also helpful to know how to go about using these in your code for real-world tasks. Many of these models have publicly available code and weights, most of which are hosted on HuggingFace. HuggingFace is the go-to repository for accessing and interacting with ML/AI datasets and models, and it’s well worth spending some time to create an account and to try poking around to familiarize yourself with the UI.

Follow me on HuggingFace: https://huggingface.co/sgoel9

The easiest way today to use a modern LLM is HuggingFace’s transformers library. They have a great course to learn the basics of the library. It’s not too long, so I would recommend going through it to learn how to load and generate an output from any of the open-source models we’ve covered.

Introduction - Hugging Face NLP Course

HuggingFace's intro course on NLP

https://huggingface.co/learn/nlp-course/chapter1/1

Part 3: Training Methods and Pipelines

There are 3 main steps to the training pipeline for modern LLMs such as GPT-4 and Claude. These include:

Pretraining: Unsupervised learning on next-word prediction. Throwing internet-scale datasets at LLMs to get them started and give them a solid understanding of natural language.
Supervised Fine-Tuning: Similar to pretraining, we train the LLM on next-word prediction, but we use texts and data closer to the output type we want to see from the LLM. This typically involves more high-quality data that is curated, such as news articles, guides, etc… Much better quality than the stuff you see in pretraining.
Alignment: This is a pretty broad term. Typically RL is used to align the output of the LLM with human values, responding to instructions instead of just optimizing for next-word prediction, or whatever you want to “align” the AI to. Recent research tries to accomplish this without RL.

You can find an amazing overview of this pipeline in Chip Huyen’s blog, which I highly recommend.

RLHF: Reinforcement Learning from Human Feedback

Chip Huyen's blog post on RLHF

https://huyenchip.com/2023/05/02/rlhf.html

Pretraining

There aren’t too many works related to pretraining since most of the work required at this stage is related to large-scale web scraping and data quality. A nice overview of the type of thinking required at this stage can be found in the Fineweb paper by HuggingFace

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb Dataset paper

https://arxiv.org/abs/2406.17557

Supervised Fine-Tuning

A great example of the use cases for Supervised Fine-Tuning (SFT) can be seen in the InstructGPT paper. Here, researchers fine-tuned GPT-3 on an instruction-following dataset, aligning the model’s responses with the expected output from a helpful assistant, rather than a next token generator, which is what an LLM would be right after the pretraining stage.

Training language models to follow instructions with human feedback

InstructGPT Paper

https://arxiv.org/abs/2203.02155

Alignment

Lastly, we’ll look at a few papers and works around alignment, which is one of the most varied parts of the LLM training pipeline. There are a few popular methods to induce alignment, but the most well-known, and what led to ChatGPT’s breakthrough performance, is Reinforcement Learning from Human Feedback (RLHF). Read Chip Huyen’s article on it if you haven’t yet, to understand the process of training a model using RLHF.

RLHF: Reinforcement Learning from Human Feedback

Chip Huyen's blog post on RLHF

https://huyenchip.com/2023/05/02/rlhf.html

I consider this next paper somewhat optional, but a lab out of Stanford used some math to show that one can mimic RLHF without a preference model to make training faster and more stable. It’s still a debate if Direct Preference Optimization (DPO) actually leads to better results than RLHF, but it’s become a somewhat canon part of the literature now.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct Preference Optimization paper

https://arxiv.org/abs/2305.18290

Another interesting development is the rise of Reinforcement Learning from AI Feedback (RLAIF), which as it sounds like, is just RLHF but with the replacement of a human with a sufficiently powerful AI model such as GPT-4. The idea rose in popularity after the following paper came out, demonstrating that using GPT-4 as a judge to mimic a human in LMSYS’s Chatbot Arena led to similar performance as humans. The broader idea of using an LLM as a decision-maker in place of a human has come to be known as LLM-as-a-Judge.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LLM-as-a-Judge and Chatbot Arena paper

https://arxiv.org/abs/2306.05685

Finally, we’ll touch on Anthropic’s Constitutional AI paper, where their approach to alignment can be described as something along the lines of a massive system prompt telling their model to not be evil. An interesting read for sure.

Constitutional AI: Harmlessness from AI Feedback

Anthropic paper on Constitutional AI

https://arxiv.org/abs/2212.08073

Positional Encoding

Finally, it’s important to have some understanding of positional encoding methods, as these are ever-present throughout the model training pipeline and have evolved alongside LLMs themselves.

Positional Encoding is the science behind how the LLM is able to derive an ordering from its input. The actual internals of a transformer model are position-invariant, so we must inject position awareness into its input, such that it would have some way to know that in the phrase “the dog is brown”, the word “dog” comes before the word “brown”.

The original Attention is All you Need paper used sinusoidal positional encoding, but this was quickly shown to be suboptimal when compared to alternatives. For a nice overview of positional encoding, read the relevant section of the following post by Lilian Weng. Feel free to read the entire post though, it’s extremely well-written and very informational but very dense.

The Transformer Family Version 2.0

Lilian Weng's post on Transformers, but with a helpful section on positional encoding

https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/#positional-encoding

While the following two papers were mentioned in the post above, they are still widely used when training LLMs today, so it’s worth reading the papers themselves to get a stronger understanding of the intuition behind how today’s models are trained.

Enhanced Transformer with Rotary Position Embedding

Rotary Positional Encoding paper

https://arxiv.org/abs/2104.09864

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

ALiBi Positional Encoding paper

https://arxiv.org/abs/2108.12409

Part 4: Inference and Training Optimization

While the transformer was a big improvement over its predecessor the RNN due in large part to its increased efficiency and parallelizable nature, the quadratic scaling of an LLM’s processing with respect to its input is still a bottleneck that many works have tried to improve upon. There are a ton of directions that LLM inference-time and training-time optimization have taken, with the most notable focuses being non-global attention, multi-head attention optimization, quantization, efficient fine-tuning, distillation, and more.

Non-Global Attention

While the original transformer used a global attention mechanism that had all tokens attend to all tokens, there has been a decent amount of work around relaxing this requirement while retaining model performance. The following three papers represent the early works in this vein of optimization and are interesting to read to solidify one’s understanding of attention, multi-head attention, and context utilization. It’s unknown what attention maps modern models use, but many of the best open-source models utilize variants of the ideas presented here.

Generating Long Sequences with Sparse Transformers

Sparse Transformer Paper

https://arxiv.org/abs/1904.10509

The Long-Document Transformer

Longformer Paper

https://arxiv.org/abs/2004.05150

Transformers for Longer Sequences

Big Bird Paper

https://arxiv.org/abs/2007.14062

Multi-Head Attention Optimization

A small detour and once again a technique used by many of the best open-source models today is multi-query attention and its successor, grouped-query attention. While non-global attention techniques focus on reducing the footprint of the model’s attention map, these works focus on reducing the model’s parameter count by reducing the cost of multi-head attention with a transformer block.

Fast Transformer Decoding: One Write-Head is All You Need

Multi-Query Attention paper

https://arxiv.org/abs/1911.02150

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Grouped Query Attention paper

https://arxiv.org/abs/2305.13245

Distillation

Distillation was born out of a pretty simple idea: What if you trained the outputs of a smaller model on the outputs of a bigger model? If the bigger model’s performance is better, maybe you could replicate or at least get close to its performance with a smaller, cheaper model by trying to have it “mimic” the larger model.

Turns out, this works pretty well in practice. Distillation was first popularized in a paper from HuggingFace, where they distilled BERT and were able to achieve roughly similar performance with a much smaller model.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

HuggingFace paper on DistilBERT

https://arxiv.org/abs/1910.01108

Distillation works for both embedding models and LLMs, with distillation from bigger-to-smaller LLMs having dramatic implications for cost-performance ratios. The next paper will discuss this process in LLMs and is a bit more up-to-date on modern distillation than the original HuggingFace paper.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

High-level paper on LLM distillation

https://arxiv.org/abs/2305.02301

The KV-Cache

While not as academic of a topic, the KV-cache is extremely important to making modern transformers work well. There unfortunately aren’t any good academic papers introducing the topic, but there are many blog posts available to help you get the gist of this powerful inference-time optimization technique.

Transformers KV Caching Explained

KV-caching is a powerful inference-time optimization technique for transformers

https://medium.com/@joaolages/kv-caching-explained-276520203249

Parameter-Efficient Fine Tuning

There’s a few ways to go about efficient fine-tuning of LLMs. The first is through brute-force prompting, which is often all that’s needed to get a ton of performance out of a language model. For more interesting techniques though, the next evolution in fine-tuning methods is called soft prompting, where optimized vectors are concatenated to the input after it is embedded and is ready to be sent through a transformer block.

Overview of Soft Prompting

HuggingFace's guide to Soft Prompting techniques

https://huggingface.co/docs/peft/conceptual_guides/prompting

While soft prompting is interesting, it’s actually not used very often in practice. What’s more common is a technique called LoRA, where low-rank adopters are added to the weight matrices of a neural network to achieve a “full” fine-tune without having to train the full number of parameters. If you’re interested in training a large LLM but don’t have a bunch of GPUs, LoRA is the way to go.

Low-Rank Adaptation of Large Language Models

Low-Rank Adaptation (LoRA) paper

https://arxiv.org/abs/2106.09685

Quantization

Quantization means “quantizing” the 32-bit representation of a number into a smaller number of bits. For example, 4-bit quantization involves taking a bunch of 32-bit numbers (the default number of bits in a number) and representing them only using 4 bits. In this case, the numbers in question are the LLM’s weights. As you can imagine, this makes storing the LLM 8 times easier (in the case of 4-bit quantization) and helps it generate tokens faster. The downside (and it’s a big one), is that it tends to make LLMs a lot dumber. There’s been a lot of research into how to best train LLMs that make use of quantization, however, and it’s been becoming very popular, with many new open-source models taking advantage of 16-bit quantization, and some starting to venture into the territory of 8-bit quantization. The blog post below by HuggingFace is a great primer into the pros and cons of quantization and is well worth a read.

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes

Blog post by HuggingFace overviewing quantization

https://huggingface.co/blog/hf-bitsandbytes-integration

In addition to having a very helpful introductory blog post, HuggingFace also has a useful library called bitsandbytes, which works well with models loaded from HuggingFace to load and use them with easy quantization. If you’re interested in experimenting with some quantization, this is a good place to start.

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Blog post by HuggingFace introducing bitsandbytes, their library for quantization

https://huggingface.co/blog/4bit-transformers-bitsandbytes

Finally, QLoRA is an interesting extension of LoRA from the previous section, with the Q standing for Quantized. A cool example of how its possible to combine ideas in LLM research to come up with new innovative solutions to hard problems!

QLoRA: Efficient Finetuning of Quantized LLMs

Quantized LoRA paper

https://arxiv.org/abs/2305.14314

Hardware Optimizations

I’ll admit that I don’t have much knowledge about GPUs and CUDA, but even still I’ve taken time to read the Flash Attention paper, which is incredibly important for anyone even thinking about training LLMs. It’s a great read even without a hardware background, and a landmark paper in the field that’s worth knowing about at the very least.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention paper

https://arxiv.org/abs/2205.14135

Decoding Optimizations

One interesting optimization angle to consider is the bottleneck of LLMs only generating a single token at a time. What if instead, they were able to generate multiple tokens at a time, increasing the speed of response generation, all while keeping the performance of an LLM that generates one token at a time? That’s the main idea behind the following two papers, which introduce two techniques for this: speculative decoding, and lookahead decoding. While the latter is introduced as a successor to the former, speculative decoding is only recently finding its way into large open-source LLMs as a viable technique for faster generations.

Fast Inference from Transformers via Speculative Decoding

Speculative decoding paper

https://arxiv.org/abs/2211.17192

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

LMSYS paper on lookahead decoding

https://lmsys.org/blog/2023-11-21-lookahead-decoding/

Part 5: Open-Source Advancements

There are a few important papers in the open-source LLM literature, many of which are tied to some of the state-of-the-art OS models used today.

LLaMA: Large Language Model Meta AI

While the name isn’t particularly creative, the LLaMA family of models is some of the most advanced open-source sets of LLMs available today, powered by Meta’s immense resources, AI talent, and commitment to open source. We’re currently on LLaMA 3, but it’s worth reading the LLaMA 1 and LLaMA 2 papers to get a sense of the evolution of these models and the techniques that came in and out of popularity throughout the models’ development.

LLaMA: Open and Efficient Foundation Language Models

LLaMA 1 Paper

https://arxiv.org/abs/2302.13971

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2 Paper

https://arxiv.org/abs/2307.09288

The Llama 3 Herd of Models

Llama 3 Paper

https://arxiv.org/abs/2407.21783

Mistral AI

Next up is the Mistral series of models, which have never been to SOTA, but are good to be aware of as they are still a relatively well-known lab and one of the few releasing frontier models.

Mistral 7B

Mistral 7B Paper

https://arxiv.org/abs/2310.06825

Mixtral of experts

Mistral AI blog post on their Mixtral of experts model

https://mistral.ai/news/mixtral-of-experts/

Phi

The Phi series of models by Microsoft is unique in that they are focused on maximizing performance at small parameter counts, a stark contrast from many other model providers trying to max out all possible scaling laws. Due to this unique direction, there are some interesting nuggets and research directions presented in these papers to make the most of smaller models.

Textbooks Are All You Need II: phi-1.5 technical report

Phi-1.5 Paper

https://arxiv.org/abs/2309.05463

Phi-4 Technical Report

Phi-4 Paper

https://arxiv.org/abs/2412.08905

Deepseek

Deepseek is a Chinese lab with models that are extremely strong in coding and math and are at the very least the strongest open source models in these areas.

DeepSeek-Coder: When the Large Language Model Meets Programming

DeepSeek-Coder Paper

https://arxiv.org/abs/2401.14196

DeepSeek-V3 Technical Report

DeepSeek-V3 Paper

https://arxiv.org/abs/2412.19437

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1 Paper

https://arxiv.org/abs/2501.12948

Closed-Source Technical Reports

While this section is about open-source models, I would be remiss to not mention iconic reports such as those about GPT-4, Claude, and Gemini, especially given their positions as the strongest models today.

GPT-4 Technical Report

GPT-4 Paper

https://arxiv.org/abs/2303.08774

The Claude 3 Model Family: Opus, Sonnet, Haiku

Claude 3 Paper

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

Introducing Gemini 2.0: our new AI model for the agentic era

Gemini 2 Paper

https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/

Honorable Mentions

A few honorable mentions that are worth knowing about:

ModernBERT is a new take on BERT combining new knowledge, better data, and more compute.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

ModernBERT Paper

https://arxiv.org/abs/2412.13663

Apple Intelligence is an interesting series of models focused on being small enough to fit on-device, which brings its own set of challenges.

Apple Intelligence Foundation Language Models

Apple Intelligence Paper

https://arxiv.org/abs/2407.21075

Vicuna, an open-source chatbot released by LMSys, uses distillation techniques from conversations between GPT-4 and humans to achieve extremely strong performance. A great look into the power of distillation from a bigger model to a smaller model.

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality

Vicuna Paper by LMSYS

https://lmsys.org/blog/2023-03-30-vicuna/

Part 6: Agents

There isn’t much novel content to be covered here (in this post at least), since agents are more of a use case of LLMs rather than a novel development within themselves.

I would recommend starting with the Toolformer paper, which helped kick off the notion of LLMs using tools and functions to accomplish their goals. This idea eventually transformed into the modern-day notion of Agents, though even defining what that means is still hazy.

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer Paper

https://arxiv.org/abs/2302.04761

Transitioning towards modern-day agent architectures, we have the ReAct paper, which while relatively straightforward, demonstrates the addition of chained responses to tool use to create an agent that can “reason” to take “actions”.

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct Paper

https://arxiv.org/abs/2210.03629

I’ll conclude the papers included in this section with Stanford’s landmark Generative Agents work, where researchers created an online town with all residents simulated by an individual LLM. Each “resident” had their own memory, time to reflect, and an awareness of their surroundings. Putting these all together, the LLM agents/residents interacted with the world and with each other, leading a cohesive life and developing storylines of their own. While the non-highlighted transcripts are a bit bland to read, this was truly a step function work in demonstrating what an LLM powered by history, memory, tool usage, and the ability to interact with other LLMs could do.

Generative Agents: Interactive Simulacra of Human Behavior

Generative Agents Paper

https://arxiv.org/abs/2304.03442

While not a paper, I can’t get away with not mentioning OpenAI’s o3 model. While not an agent in itself (as far as we know), the idea that scaling test-time compute will lead to reliable agents is taking hold in the literature and in the industry.

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

OpenAI o3 Paper

https://arcprize.org/blog/oai-o3-pub-breakthrough

Lastly, if you would like an overview of autonomous agents, Chip Huyen once again provides a great overview in her blog.

Agents

Chip Huyen's blog post on agents

https://huyenchip.com/2025/01/07/agents.html

Key Benchmarks:

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Extra Resources

Conclusion

Whew, this was a long post! The incredible thing, however, is that we’re just touching the surface of all the cool LLM research happening now. The number of AI-related papers published doubles every few months now, and with the surge of interest in LLMs over the last few years, is only going to keep increasing. There are so many unsolved problems, exciting opportunities, and new research directions to discover - good luck!

If this post helped you in any way or if you’ve caught up to speed and are interested in participating in new LLM research, I’d love to hear from you. Feel free to contact me anytime at sgoel9@berkeley.edu!

Getting Caught Up to Modern LLM Research

Table of Contents

Introduction

Preamble

Technical Prerequisites

But what is a neural network?

Neural Networks: Zero to Hero

Intro to PyTorch

Part 0: RNNs, Encoder-Decoder, and Attention

Seq2Seq Implementation with RNNs

Classification with RNNs

Generation with RNNs

Encoder-Decoder Architecture

Encoder-Decoder Architecture using RNNs

Encoder-Decoder Models for Seq2Seq Tasks

The Attention Mechanism

Additive Attention in RNNs

The Illustrated Attention Mechanism

Attention-based Seq2Seq Translation

Extra Resources

Part 1: The Transformer

Attention Is All You Need

The Illustrated Transformer

The Annotated Transformer

Extra Resources

Implementation Details

TransformerEncoderLayer PyTorch Module

TransformerDecoderLayer PyTorch Module

TransformerEncoder PyTorch Module

TransformerDecoder PyTorch Module

Transformer PyTorch Module

Part 2: Past the Original Transformer

Early Derivatives

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Key Benchmarks:

Improving Language Understanding by Generative Pre-Training

Advancing Encoder-Decoder Models

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Key Benchmarks and Metrics:

Decoding Strategies

Decoding Strategies that You Need to Know for Response Generation

Scaling up the Generative Pre-trained Transformer

Language Models are Unsupervised Multitask Learners

Language Models are Few-Shot Learners

Scaling Up [in General]

Scaling Language Modeling with Pathways

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Training Compute-Optimal Large Language Models

Key Benchmarks:

The Rise of the Embedding Model

Efficient Estimation of Word Representations in Vector Space

Sentence Embeddings using Siamese BERT-Networks

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lost in the Middle: How Language Models Use Long Contexts

Large Language Models Are Secretly Powerful Text Encoders

Quantifying Positional Biases in Text Embedding Models

Key Benchmarks:

Applying Transformers to Vision Tasks

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Learning Transferable Visual Models From Natural Language Supervision

Visual Instruction Tuning

Zero-Shot Text-to-Image Generation

Hierarchical Text-Conditional Image Generation with CLIP Latents

Improving Image Generation with Better Captions

Applications and Honorable Mentions

Evaluating Large Language Models Trained on Code

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Robust Speech Recognition via Large-Scale Weak Supervision

Key Benchmarks:

Alternative Architectures

Efficiently Modeling Long Sequences with Structured State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Extra Resources

Introduction - Hugging Face NLP Course

Part 3: Training Methods and Pipelines

RLHF: Reinforcement Learning from Human Feedback

Pretraining