Australian National University

Wenyi Pi

#### Understanding the Evolutionary Journey of LLMs

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*fpp2b149lgG3cmWF" />
<figcaption>Image generated in DALL-E 2.</figcaption>
</figure>

### Author

- Wenyi Pi (**ORCID**:
 [0009--0002--2884--2771](https://orcid.org/0009-0002-2884-2771))

### Introduction

When we talk about large language models (LLMs), we are actually
referring to a type of advanced software that can communicate in a
human-like manner. These models have the amazing ability to understand
complex contexts and generate content that is coherent and has a
human feel.

If you've ever chatted with an AI chatbot or virtual assistant, you
might actually interact with a LLMs, probably without even realising it.
These models are used far beyond chatbots and have a wide range of
applications such as in text generation, automatic translation,
sentiment analysis, document summarisation, and many other scenarios!

LLMs have become an essential part of the artificial intelligence (AI)
landscape. In this article, we will delve into the world of LLMs,
exploring their history and the evolution of LLMs.

### What Is a Large Language Model?

Large Language Models (LLMs) refer to large, general-purpose language
processing models that are first pre-trained on extensive datasets
covering a wide range of topics to learn and master the fundamental
structures and semantics of human language. The term "large" in this
context denotes both the substantial amount of data required for
training and the billions or even trillions of parameters that the model
contains. Pre-training equips the model to handle common language tasks
such as text classification, question answering, and document
summarisation, demonstrating its versatility.

After pre-training, these models are typically fine-tuned for specific
applications, such as on smaller, specialised datasets targeted at
particular domains like finance or medical, to enhance accuracy and
efficiency in addressing specific issues. This approach of pre-training
followed by fine-tuning enables LLMs not only to solve a broad range of
general problems but also to adapt to specific application requirements.

### Evolution of Large Language Models

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*B5LPhdBHfCjoGQpk" />
<figcaption>Large language model (LLM) timeline. Source: <a
href="https://www.youtube.com/watch?v=K7o5_Fj7_SY">Brief History of
Large Language Models &amp; Generative AI | Evolution of NLP from Eliza
to ChatGPT</a></figcaption>
</figure>

The image above provides an overview of the timeline for LLMs. We will
discuss each important phase in detail in the following sections.

#### Early Days: chatbots and rule-based systems (1960s)

Way back in 1966, the world witnessed the birth of ELIZA, which is
considered to be the first chatbot ever built by humans. Created by
Joseph Weizenbaum at MIT, ELIZA was a groundbreaking experiment of its
time, enabling human-computer interaction. While it didn't understand
the conversation context like the way humans do or like ChatGPT does
nowadays, it can create an illusion of a conversation by rephrasing user
statements as questions using pattern matching and substitution
methodology. At that time, many variations of the chatbot were made and
one of the most well-known is called DOCTOR, this was made to respond
like a Rogerian psychotherapist. In this instance, the therapist
"reflects" on questions by turning the questions back at the patient.
While ELIZA was a humble beginning, this surely paved the way for
further research in the field of Chatbot and natural language processing
in the years to come. For a try out ELIZA, please use the following
link: [ELIZA](https://web.njit.edu/~ronkowit/eliza.html).

#### Rise of Recurrent Neural Networks (1980s)

Moving on into the late 20th century, we saw the emergence of neural
networks, which were deeply inspired by the human brain and its
interconnected neurons. Among these, Recurrent Neural Networks (RNN)
were the first to come in 1986 and they gained instant popularity from
the world. Unlike traditional feedforward neural networks, where the
flow of information was in one direction, RNNs could remember previous
inputs in their internal state or memory and answer questions based on
context. They are trained to process and convert a sequential data input
into a specific sequential data output and have a feedback loop, making
them suitable for natural language processing (NLP) tasks. While RNNs
were a significant step forward, they had limitations, especially with
long sentences. In simple words, they are not good at retaining memory
and suffer from long term memory loss. In technical terms, RNNs had a
problem of vanishing gradient. For a general description for RNNs, you
can visit the following
link: [RNN](https://medium.com/@researchgraph/an-introduction-to-recurrent-neural-networks-rnns-802fcfee3098).

#### Rise of Long Short Term Memory (1990s)

Long Short Term Memory (LSTM) came up in 1997. LSTM was a specialised
type of RNN. Their primary advantage was their ability to remember
information over long sequences. Thus, it overcame the short term memory
limitations of RNNs. LSTM has a unique architecture: they have an input
gate, a forget gate, and an output gate. These gates determined how much
information should be memorised, discarded, or output at each step. This
selective ability to memorise or forget helped LSTMs maintain relevant
information in their memory, making them more efficient at capturing
long-term dependencies from sentences. For example, it is able to
resolve the coreference resolution compared to RNNs.

#### Gated Recurrent Network (2010s)

In 2014, Gated Recurrent Units (GRU) came. They were designed to solve
some of the same problems as LSTMs but with a simple and more
streamlined structure. Just like LSTMs, GRUs were designed to combat the
vanishing gradient problem, allowing them retain long-term dependencies
in sentences. GRUs simplified the gating by using only two gates: an
update gate which determined how much of the previous information to
keep versus how much of the new information to consider ; and a reset
gate which determined how much of the previous information to forget.
The reduced gating in GRUs made them more efficient in terms of
computation.

#### Rise of Attention Mechanism (2014)

As it turns out RNNs, as well as RNN based variants LSTM and GRU, were
not great at retaining the context when it was far away. The NLP world
and their problems needed something more and that gave birth to the
concept of attention. The introduction of the attention mechanism marked
a significant paradigm shift in sequence modelling, offering a fresh
perspective compared to previous architectures. RNN would process
sentences with a fixed-size context vector that try to cram all the
information of a source sentence regardless of its length, into a single
fixed-length vector and because of this, their performance deteriorates
as the sentence length increases. In contrast, attention allows the
model to look back to the entire source sequence dynamically, selecting
different parts based on their relevance at each step of the output.
This ensures that no crucial information is lost or diluted, especially
in longer sequences.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*0AsaDcZVgAlp2Tac" />
<figcaption>Performance Comparison. Source: <a
href="https://www.youtube.com/watch?v=K7o5_Fj7_SY">Brief History of
Large Language Models &amp; Generative AI | Evolution of NLP from Eliza
to ChatGPT</a></figcaption>
</figure>

The figure above illustrates the decline in RNN performance compared to
Attention Model as the length of the input sentence increases.

#### The invention of Transformers Architecture (2017)

Transformers came out in 2017 with the paper "*Attention is all you
need*" from Vaswani and colleagues of a Google Team. This new type of
architecture relied on an attention mechanism to process sequence. As
its core, it is composed of an encoder and decoder, each with multiple
stacked layers of self-attention and feed-forward neural networks. A
standout feature is the "multi-head" attention, allowing it to focus on
different parts of the input sentence simultaneously, capturing various
contextual nuances. Another strength was its ability to process
sequences in parallel rather than sequentially. These advantages enable
transformers to lay the foundation for subsequent models like BERT, GPT
and more, driving us into a new era of LLMs.

#### Emergence of Large Language Models (2018-onwards)

With the success of transformers, the next logical step was scaling.
This kick started with Google's BERT model which was released in the
year 2018. Unlike previous models that processed text either
left-to-right or right-to-left, BERT was designed to consider both
directions simultaneously, hence the name: Bidirectional Encoder
Representations from Transformer (BERT). Pre-trained on vast amounts of
text, BERT was the first proper foundational language model that could
be fine-tuned for specific tasks, setting new performance standards
across various benchmarks. With Open AI releasing its GPT-2 Model and
Google releasing its T5 Model in 2019, thereafter GPT-3 came up in 2020,
etc. These LLMs could perform innumerable tasks, marking a paradigm
shift in AI capabilities.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*BRJhuhXLfuzhG3bb" />
<figcaption>Timeline of recent years large language models. Source: <a
href="https://www.nextbigfuture.com/2023/04/timeline-of-open-and-proprietary-large-language-models.html">https://www.nextbigfuture.com/2023/04/timeline-of-open-and-proprietary-large-language-models.html</a></figcaption>
</figure>

### Conclusion

The evolution of language models from simple rule-based systems to
complex intelligence models shows significant advances in AI technology.
Today, large language models (LLMs) are more than just tools for
enhancing text-based applications, they are increasingly capable of
understanding and communicating with humans.

Additionally, these language models are also able to handle not just
text but also images and sounds, known as multimodal LLMs. Capable of
processing and generating multiple mode data, these models integrate
text, images, audio and video to comprehensively understand and analyse
different forms of data. Multimodal LLMs have a range of applications
including extracting text from digital images, understanding complex
signs, deciphering ancient handwriting, analysing speech files for
summarisation, transcription.

Through simplifying complex text, multimodal LLMs transform the way we
interact with technology and could make it more accessible and
responsive to human needs. In short, these LLMs are becoming powerful
partners for humans, helping us tackle multiple tasks and simplifying
our lives in multiple ways.

### References

- Brief History of Large Language Models & Generative AI \| Evolution of
 NLP from Eliza to ChatGPT (no date)
 [www.youtube.com.](http://www.youtube.com.) Available at:
 <https://www.youtube.com/watch?v=K7o5_Fj7_SY> .
- Ibrahim, M. (2023) An Overview of Large Language Models (LLMs), W&B.
 Available
 at:<https://wandb.ai/mostafaibrahim17/ml-articles/reports/An-Overview-of-Large-Language-Models-LLMs---VmlldzozODA3MzQz?galleryTag=llm#what-is-a-large-language-model>?.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
 Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All
 You Need (Version 7). arXiv.
 <https://doi.org/10.48550/ARXIV.1706.03762>

‌

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3c2efa517112){width="1"
height="1"}

Understanding the Evolutionary Journey of LLMs Author Wenyi Pi (

 ORCID

: 0009–0002–2884–2771) Introduction When we talk about large language models (LLMs), we are actually referring to a type of advanced software that can communicate in a human-like manner. These models have the amazing ability to understand complex contexts and generate content that is coherent and has a human feel.

Brief Introduction to the History of Large Language Models (LLMs)

Dhruv Gupta

#### Attention mechanism not getting enough attention

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*rNlIXsxZSIs6p63O" />
<figcaption>Image from Unsplash.</figcaption>
</figure>

### Author

- Dhruv Gupta (**ORCID**:
 [0009--0004--7109--5403](https://orcid.org/0009-0004-7109-5403))

### Introduction

As discussed in this
[article](https://medium.com/@researchgraph/rnns-vs-grus-vs-lstms-d69fd3b3f455),
RNNs were incapable of learning long-term dependencies. To solve this
issue both LSTMs and GRUs were introduced. However, even though LSTMs
and GRUs did a fairly decent job for textual data they did not perform
well. Transformer-based models which first came out in 2017 took the
Natural Language Processing (NLP) world by storm. They were initially
introduced to solve the problem of sequence-to-sequence translation.
However, they have now become the backbone of almost all the generative
AI models. Models like GPT-3 and BERT, use large transformer-based
models for training on huge amounts of data. In this article, we will
discuss the architecture of transformer-based models and how the
attention mechanism works.

### Transformers

Transformer-based models introduced two new concepts which changed the
field of NLP forever. They were attention-based models and
encoder-decoder models. The concept of attention-based models allowed
the language models to focus on only important parts of the text. This
gave the model the ability to comprehend the long-term dependencies. In
addition to this, the attention mechanism along with the encoder-decoder
architecture of the transformer, empowered the model to comprehend the
subtle nuances in the text and get the textual data, like how we humans
do it.

### What is the Attention mechanism?

The attention mechanism allows a machine learning model to emphasise
certain aspects of the input data and this forms the heart of a
transformer model. However, the transformer model uses a set of
self-attention blocks. Self-attention blocks allow the model to focus on
different positions of a single sequence thereby computing a
representation of the sequence. So the important question is how does
this attention mechanism work?

#### **Keys, Values, and Queries?**

Keys, values, and queries are the heart of the self-attention mechanism.
They are used to calculate the self-attention weights using the input
**X**. The input **X** is multiplied by weight matrices that are learnt
during training.

- Query Vector **(Q) = X Wq**. It can be thought of as the current word.
- Key Vector **(K) = X Wk**. This acts as the index value of the
 value vector.
- Value Vector **(V) = X Wv**. Can be considered as the information
 stored in the input word.

What self-attention does is that for every query (**Q**) the most
similar key **K**, is calculated using the dot product between **Q** and
**K**. The dot product is then multiplied with a v vector to get the
information stored in the input word.

<figure>
<img src="https://cdn-images-1.medium.com/max/628/0*viA1Ma9jV8Jen1IW" />
<figcaption>Attention equation being used in transformers. Source:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A.N., Kaiser, L. and Polosukhin, I. (2017). Attention Is All You
Need. Link:<a
href="https://doi.org/10.48550/arXiv.1706.03762">https://doi.org/10.48550/arXiv.1706.03762</a></figcaption>
</figure>

### Architecture

The transformer uses an encoder-decoder architecture. The idea behind
such an architecture is that the encoder processes the input data and
transforms it into a different representation, which is then
subsequently decoded by the decoder to produce the desired output.
Transformers which were originally designed for translation use this
architecture a bit differently. The encoder is given the input sentence
and the decoder is given the same sentence in the target language.
However, the decoder only gets the words that have been translated. For
example: if there are 5 words in a sentence and three have been
translated. The decoder will get the already translated words and the
original input sentence and will try to predict the fourth word.

<figure>
<img src="https://cdn-images-1.medium.com/max/693/0*HyLiYiY_6k4XOy09" />
<figcaption>Encoder Decoder cell in transformer</figcaption>
</figure>

#### **Diving into the Architecture**

As discussed in the above section the transformer model consists of an
encoder-decoder model.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*KuZhQSjOQVkOx27s" />
<figcaption>Complete Architecture of transformer block. Source: Vaswani,
A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L. and Polosukhin, I. (2017). Attention Is All You
Need. Link:<a
href="https://doi.org/10.48550/arXiv.1706.03762">https://doi.org/10.48550/arXiv.1706.03762</a></figcaption>
</figure>

#### Encoder

The input data is first converted into word embeddings which are then
concatenated with the positional embeddings. The positional embeddings
play an important role because the word embeddings themselves lack the
positional information which plays a crucial role in textual data. The
encoder model has two sub-layers: multi-head self-attention and the
feed-forward neural network. The multi-head self-attention model
consists of a self-attention mechanism which is applied parallelly. This
output is then passed through to the feed-forward neural network which
then learns from the output of the attention model.

#### Decoder

The decoder also consists of a similar sub-layer as the encoder. There
is a masked multi-head attention block which works on the output vectors
of the previous iteration. The multi-head attention block works on the
output of the encoder and masked multi-head attention in the decoder.
The output of the attention block is then passed through to a fully
connected feed-forward neural network which then produces the output
probability for the next words.

The output of the decoder model is fed into the decoder again just like
a typical RNN model. The masked multi-head attention block is used to
mask the future words so that the decoder only generates the output
using the previously seen outputs and previously seen input from the
encoder model.

### Are they still worthy?

Transformer models which first came out in 2017 for language translation
have changed the face of NLP. While NLP tasks are where transformers are
at their peak, they have a lot of other applications outside of text
processing. They can be used for tasks such as speech recognition, image
captioning, text classification, among others.

Apart from solving NLP-related tasks, transformer models have also
formed the backbone of the new-age generative AI models. Most of the
generative AI models such as GPT, GPT-3, and BERT have stacks of
transformer models in them because of their excellent feature extraction
capability. Additionally, the newer version of generative AI models such
as RAG also has transformer-based encoder-decoder models working in
the backend.

Additionally, big companies such as Google, Facebook, Vault, and
Grammarly which are heavily focused on NLP-based applications still use
transformer-based models in the backend. Their ability to accurately
recognise patterns and context in data makes them an invaluable tool in
the field of NLP.

Therefore, in conclusion, with continued advancements in the field of
LLM, transformers have a big role to play. Hence, with a vast array of
applications and even more to follow, transformers are still worthy of
solving almost all complex NLP-based tasks.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*lAJNOh-EtkhBAAeQ" />
<figcaption>Source: <a
href="https://en.meming.world/wiki/I%27m_Still_Worthy">https://en.meming.world/wiki/I%27m_Still_Worthy</a></figcaption>
</figure>

### **References**

- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
 A.N., Kaiser, L. and Polosukhin, I. (2017). *Attention Is All You
 Need*. \[online\] arXiv.org.
 doi:https://doi.org/10.48550/arXiv.1706.03762.
- Kulshrestha, R. (2020). *Transformers*. Medium. Available at:
 <https://towardsdatascience.com/transformers-89034557de14.>
- huggingface.co *How do Transformers work? --- Hugging Face NLP
 Course*. \[online\] Available at:
 <https://huggingface.co/learn/nlp-course/en/chapter1/4>
- Kulshrestha, R. (2020b). *Understanding Attention In Deep Learning*.
 Medium. Available at:
 <https://towardsdatascience.com/attaining-attention-in-deep-learning-a712f93bdb1e.>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=322fb0e912cf){width="1"
height="1"}

Attention mechanism not getting enough attention Author Dhruv Gupta (

 ORCID

: 0009–0004–7109–5403) Introduction As discussed in this article, RNNs were incapable of learning long-term dependencies. To solve this issue both LSTMs and GRUs were introduced. However, even though LSTMs and GRUs did a fairly decent job for textual data they did not perform well.

Messaggi di Rogue Scholar

Brief Introduction to the History of Large Language Models (LLMs)

Transformers Models in NLP