Australian National University

Qingqin Fang

#### An improvement architecture superior to the Transformer, proposed by Meta

<figure>
<img src="https://cdn-images-1.medium.com/max/852/0*pFztXZpCGW0GplEE" />
<figcaption>Image Generated by Canvas</figcaption>
</figure>

### Author

· Qingqin Fang (**ORCID:**
[0009--0003--5348--4264](https://orcid.org/0009-0003-5348-4264))

### Introduction

Recently, researchers from Meta and the University of Southern
California have introduced a model called Megalodon. They claim that
this model can expand the context window of language models to handle
millions of tokens without overwhelming your memory.

In previous studies on long texts, the focus was mainly on enhancing
performance through optimizations of the Transformer. However, lab data
from recent experiments show that Megalodon outperforms Transformer
models of similar size when processing large volumes of text, and many
researchers are beginning to view it as the potential successor to the
Transformer.

Transformer architecture is generally limited by its high complexity and
its weak ability to handle very long inputs. While there are simpler
solutions that try to address these issues, they usually don't perform
as well in terms of training speed or accuracy on practical tasks. The
emergence of Megalodon is specifically aimed at addressing the challenge
of handling infinite context, which makes it better suited for tasks
where understanding extensive context is crucial.

This post delves into Megalodon.

### Long Context Window

In long-text situations, "**context window**" indicates the number of
tokens a model can handle at one time. Expanding this window enables LLM
to engage in longer conversations, process a wider range of documents,
and learn more information.

In machine learning models, the **attention** mechanism is a special
structure that enhances the weights of certain parts of the input data
while reducing the weights of other parts. This mechanism automatically
learns and calculates the contribution of input data to the output data,
allowing the model to focus on different parts of the input with varying
levels of importance during task execution. This concept is somewhat
similar to human attention.

The diagram below shows the core parts and steps of the **QKV
attention** mechanism, an acronym for Query, Key, and Value, commonly
used in **self-attention** mechanisms, particularly in Transformer
models. Initially, input data is linearly transformed into three
segments: Query (yellow), Key (pink), and Value (green). Each Query is
then dot-multiplied with all Keys to calculate their similarity scores,
which are normalized using the Softmax function. This process assigns
attention weights to each Key relative to each Query, ultimately
producing the final output based on these weights.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*0CGhyOD6zPwz2Nl3" />
<figcaption>Transformer-Attention Flow (scaled-dot-product). Link: <a
href="https://scholar.harvard.edu/sites/scholar.harvard.edu/files/binxuw/files/mlfs_tutorial_nlp_transformer_ssl_updated.pdf">From
Transformer to LLM: Architecture, Training and Usage</a></figcaption>
</figure>

However, self-attention mechanism compares each element in the input
sequence with all others using Scaled Dot-Product Attention. This means
that if the input size is doubled, the memory and computation time
needed to process the input will quadruple. Expanding the Transformer's
context window comes with significant costs. The challenge with
Transformers lies in their "**quadratic complexity**".

Meanwhile, Transformers also face issues with "**the lack of
position-related inductive bias**", which can impact their performance.
Unlike Recurrent Neural Networks (RNNs) or Convolutional Neural Networks
(CNNs), which naturally process data sequentially or have the ability to
recognize spatial structures, transformers rely on **position
embeddings** to incorporate information about the order of tokens. This
method involves adding positional information to the word embedding
vectors of each token, allowing the same word to have different
representations depending on its position in the sentence.

This artificially creates a position-related bias, but such a bias may
not be as effective as the natural positional sensitivity found in other
models. When dealing with tasks where positional context is crucial, for
example, the two *"I"* are located at different positions in the
sentence "*I think, therefore I am",* it should have the same output
representations, but the two I's don\'t. This can sometimes lead to
inefficiencies or require additional mechanisms to address the issue.

In 2022, the **Moving Average Equipped Gated Attention**(MEGA) model was
introduced as a significant enhancement to the traditional Transformer
architecture, particularly improving its performance in handling long
sequence data. [This
article](https://medium.com/@_init_/how-self-attention-with-relative-position-representations-works-28173b8c245a)
provides a more detailed explanation.

MEGA incorporates the **Exponential Moving Average** (EMA) into the
self-attention mechanism. Moving averages, which is a classic approach
for sequential data modelling, widely used in time series data to smooth
out short-term fluctuations and highlight long-term trends or cycles is
calculated for each element's Query and Key, smoothing data and updating
weights. These Queries and Keys reflect not only the current input
features but also the influence of historical data, enabling more
effective capture and utilization of long-distance contextual
information. EMA facilitates the flow of information across different
time steps, improving the model's understanding of long-range
dependencies within sequences.

Moreover, by dividing the computation of the attention mechanism into
smaller segments, MEGA substantially reduces computational complexity
from O(n²) to **approximately O(n)**. The use of EMA allows the MEGA
model to maintain a more coherent flow of information between local and
overall context, helping to prevent the loss of information during the
processing across blocks. This restructuring makes it possible to handle
longer data sequences more efficiently while preserving the integrity of
the information processed.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*IhNa1lHvSi8YNuXh" />
<figcaption>Mega — model architecture, Source: Xuezhe Ma et al., 2022.
Link: <a
href="https://arxiv.org/pdf/2209.10655">https://arxiv.org/pdf/2209.10655</a></figcaption>
</figure>

### Megalodon

Building on the MEGA architecture, Megalodon has implemented a series of
enhancements and added several new technical components:

Firstly, Megalodon introduces the **Complex Exponential Moving Average
(CEMA)** component, a novel technique that extends the [Multidimensional
Damped Exponential Moving Average](https://arxiv.org/abs/2312.02549)
(EMA) from MEGA into the complex domain. The previous MEGA model was
based on real numbers, and Megalodon extends it to include imaginary
numbers, hence the term "complex domain" here. This enhancement boosts
the model's capability to process complex data, thereby improving
performance and stability.

Secondly, normalization is a crucial step in Transformer models,
involving adjusting data to a distribution with a mean of 0 and a
variance of 1. This reduces variations in data distribution during model
training, thus helping the model learn faster. The commonly used
normalization methods are:

- **Layer Normalization** adjusts the outputs by calculating the mean
  and standard deviation of all neuron outputs within a layer. This
  method effectively stabilizes the hidden states in recurrent networks
  but does not account for batch and sequential data properties,
  resulting in poor management of dependencies between time steps.
- **Group Normalization** normalizes neuron outputs by grouping them.
  This method is not affected by batch size and maintains good stability
  across a wide range of batch sizes. However, its performance heavily
  depends on how the channels are grouped, and it may inadvertently leak
  future information in autoregressive models due to normalization
  across different time steps.

To address these issues, researchers have proposed a new, more complex
method called **Timestep Normalization**. This technique extends group
normalization to autoregressive sequence modelling, allowing for
independent normalization at each time step. This enables the model to
effectively adjust step-by-step in sequence processing, ensuring that
dependencies between time steps do not compromise model performance.
This method helps the model handle sequential data more accurately,
improving processing effectiveness and efficiency.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*V5PbIbNUwDAash4D" />
<figcaption>Normalization methods. The elements in blue or yellow are
the regions to compute means and variances. Source: Xuezhe Ma et al.,
2024. Link: <a
href="https://arxiv.org/pdf/2404.08801">https://arxiv.org/pdf/2404.08801</a></figcaption>
</figure>

Additionally, Megalodon has introduced a **normalized attention
mechanism**. This involves normalizing the queries (Query) and keys
(Key) before calculating the attention scores, which helps stabilize the
scores and addresses issues caused by inconsistent sizing, improving the
model's overall reliability and training stability.

These adjustments are achieved by enhancing common normalization
techniques. By breaking down the input sequence into parts, Megalodon
simplifies the computational process and memory usage during model
training and inference, making it more efficient. This allows Megalodon
to handle large amounts of data more effectively without using excessive
resources.

### Experiment and its Results

Researchers trained a 7-billion-parameter Megalodon model on the same 2
trillion pre-training data as used with the Llama2 model, allowing for a
completely fair comparison with Llama2--7B. The final results are shown
in the diagram below.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*AJmNTzXPJVCQRkK8" />
<figcaption>Processed tokens during training for MEGALODON-7B, LLAMA2–7B
and LLAMA2–13B. Source: Xuezhe Ma et al., 2024. Link: <a
href="https://arxiv.org/pdf/2404.08801">https://arxiv.org/pdf/2404.08801</a></figcaption>
</figure>

It is evident from the direct comparison with Llama 2 that Megalodon not
only trains more efficiently but also surpasses the Transformer in
accuracy when handling tasks with 7 billion parameters and 2 trillion
training tokens. Specifically, Megalodon's training loss was 1.70, which
is between Llama2--7B (1.75) and 13B (1.67).

At the same time, researchers used the same computational resources to
run experiments with LLAMA 2--7B and MEGALODON-7B, comparing their
training speeds at different context lengths.

<figure>
<img src="https://cdn-images-1.medium.com/max/836/0*BGeSHoAz8e3dL6jV" />
<figcaption>Average word/token per second for MEGALODON-7B and
LLAMA2–7B. Source: Xuezhe Ma et al., 2024. Link: <a
href="https://arxiv.org/pdf/2404.08801">https://arxiv.org/pdf/2404.08801</a></figcaption>
</figure>

At a 4K context length, MEGALODON-7B was slightly slower (by about 6%)
than LLAMA 2--7B, due to the introduction of CEMA and timestep
normalization. However, when the context length was expanded to 32K,
MEGALODON-7B was significantly faster by 32% compared to LLAMA 2--7B,
demonstrating MEGALODON's computational efficiency in long-context
pre-training.

For further experimental details, please refer to the [full
paper](https://arxiv.org/pdf/2404.08801). Additionally, [the code for
this project has been made open
source](https://github.com/XuezheMax/megalodon), making it accessible to
a broader audience. The computational cost of running the 7-billion
parameter model is optimized to be feasible on personal computers,
enhancing its accessibility for individual researchers and developers.

### Future Thinking

So, can the Transformer be replaced by these new architectures?

As of now, the Transformer architecture still holds a leading position.
While Meta is exploring new architectures like Megalodon, it is also
actively improving its Transformer models and has recently released the
latest version of its open-source
LLM, [Llama-3](https://ollama.com/library/llama3).

One challenge that competitors to Transformers face is the need for
specialized hardware and software tools. There is already a large
library and ecosystem of tools available for training, fine-tuning, and
customizing Transformer models, suitable for various applications and
hardware configurations. Researchers have also developed optimized
software codes to enhance the performance of Transformer LLMs on
memory-constrained devices. Alternatives still need to keep up with
these advancements.

Meanwhile, other researchers are modifying the Transformer architecture
to reduce its memory and computational demands. For instance, Google
researchers recently published a paper on
[Infini-attention](https://arxiv.org/abs/2404.07143), aimed at providing
Transformers with an infinite context window without increasing memory
and computational complexity.

While new architectures like Megalodon bring exciting improvements,
particularly in handling longer sequences and reducing computational
demands, the traditional Transformer model still holds a dominant
position in the AI field due to its widespread application and
continuous development.

In the future, these new models could work alongside or gradually
integrate with the traditional Transformer framework to more effectively
address specific challenges. Whether these models will completely
replace Transformers, or rather complement them, depends on further
advancements and broader adoption of these innovative approaches in
practical applications.

Looking ahead, observing how these models evolve and are utilized across
various domains might reshape our understanding and implementation of
neural network architectures in large-scale artificial intelligence
systems. The journey from traditional Transformers to models like
Megalodon marks a dynamic and transformative phase in machine learning,
promising to enhance the capabilities and impact of AI technologies
globally.

### **References**

- Xuezhe Ma et al. (2024). Megalodon: Efficient LLM Pretraining and
  Inference with Unlimited Context Length. arXiv:2404.08801 \[cs.LG\].
  DOI: <https://arxiv.org/pdf/2404.08801>
- Xuezhe Ma et al. (2024). Mega: Moving average equipped gated
  attention. arXiv:2209.10655 \[cs.LG\]. DOI:
  <https://doi.org/10.48550/arXiv.2209.10655>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b4c7322602ba){width="1"
height="1"}


An improvement architecture superior to the Transformer, proposed by Meta  Author   · Qingqin Fang (
<strong>
 ORCID:
</strong>
0009–0003–5348–4264) Introduction   Recently, researchers from Meta and the University of Southern California have introduced a model called Megalodon. They claim that this model can expand the context window of language models to handle millions of tokens without overwhelming your memory.

The longer the context, the better? Unlimited Context Length in Megalodon

#### Solutions to Enhance LLM Performance in Long Contexts

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*XS5GWBmt48Z8OIOR" />
<figcaption>Picture From Unsplash</figcaption>
</figure>

### Author

· Qingqin Fang (**ORCID:**
[0009--0003--5348--4264](https://orcid.org/0009-0003-5348-4264))

### Introduction

In the era of AI breakthroughs, large language models (LLMs) are not
just advancements; they are revolutions, transforming how we interact
with technology, from casual conversations with chatbots to the
intricate mechanisms behind sophisticated data analysis tools. Their
ability to understand and generate human-like text has set a new
standard in artificial intelligence. Yet, amidst the accolades, a
crucial question emerges: How effectively do these AI titans manage and
interpret the vast expanses of text they're fed?

While considerable research has been dedicated to amplifying LLMs'
prowess by expanding their capacity to process extended texts, a gap
remains in our understanding of their performance across even longer
stretches of text. This gap is not just academic --- it's crucial for
the real-world application of these models.

Enter a recent, groundbreaking study that casts a spotlight on an
intriguing limitation known as the "Lost in the Middle" phenomenon. This
phenomenon underscores a significant challenge: LLMs, when tasked with
parsing through lengthy texts for information extraction --- be it in
multi-document question answering or key-value pair
retrieval --- exhibit a stark sensitivity to the placement of crucial
information. Their performance is robust when key details are positioned
at the text's extremities but wanes dramatically when these details are
ensconced in the middle.

This article is set to unravel the layers behind this phenomenon,
probing into the reasons why LLMs exhibit such a marked preference for
information placement and what this reveals about their underlying
mechanisms. More importantly, it aims to chart a path forward --- how
can we refine these digital intellects to navigate and process lengthy
texts with the finesse and depth they require? Through this exploration,
we aim not just to highlight a problem but to kindle a conversation on
potential solutions, setting the stage for the next leap in LLM
evolution.

### "Lost in the Middle" Problem

When LLMs are tasked with processing long texts, especially those
requiring the identification and use of contextually relevant
information, their performance varies significantly based on the
location of the relevant information within the input. The observed
pattern is clear: LLMs perform well when the pertinent details are at
the beginning or end of the input. However, their effectiveness
diminishes when these details are nestled in the middle.

<figure>
<img src="https://cdn-images-1.medium.com/max/890/0*EqldLkq99lpQ1Scb" />
<figcaption>Changing the location of relevant information. Source:
Nelson F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

The graph displayed above demonstrates how altering the position of
pertinent information within a language model's input context leads to a
distinctive U-shaped curve in performance. We can observe that the
model's precision is influenced by the location of information, showing
a particular challenge in handling sections of text that are not at the
forefront or the tail end.

### Findings on Experimentation

This behavior was systematically examined across various tasks, such as
document-based question answering and key-value pair retrieval. The
experiments involved prompts with varying lengths and contexts,
positioning the critical information at different points within
the text.

In the **multi-document question-answering task**, the input prompt, as
illustrated, included a relevant document capable of answering the
question among several irrelevant ones. This crucial document was placed
in various positions to test the model's stability.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*3_QTHhr0lvmrKu9E" />
<figcaption>Example of the multi-document question answering task.
Source: Nelson F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

The following results show that in scenarios with a total document count
of 10, 20, and 30, corresponding to token counts of approximately 2K,
4K, and 6K, it was observed that better performance is achieved when
relevant documents are located at the beginning or end of the prompt,
while performance decreases when the relevant documents are in
the middle.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*iUm_mdUX2FojLhLM" />
<figcaption>Result of the multi-document question answering task.
Source: Nelson F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

The findings indicated that models performed better when the relevant
document was at the beginning or the end of the prompt, with a
noticeable decrease in performance when it was situated in the middle.
Interestingly, models designed to support longer contexts did not show a
significant advantage over their counterparts in this task, indicating a
close performance between models with varying context lengths.

In the **key-value pair retrieval experiment** aimed at assessing the
models' ability to retrieve content from long texts, the key needed for
the query was positioned at different points within the prompt to test
the models' retrieval capabilities.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*cWV-XPqblUyS9iYv" />
<figcaption>Example of the key-value pair retrieval task. Source: Nelson
F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

The outcomes suggested that higher-performing models achieved
near-perfect accuracy regardless of the key's placement within the text.
However, less effective models exhibited a similar trend to the first
experiment, where the placement of the key in the middle resulted in
lower performance.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*aMmpqgHJ3sd7HO49" />
<figcaption>Result of the key-value pair retrieval task. Source: Nelson
F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

### Analysing the Causes

In response to the "Lost in the Middle" phenomenon, the researchers
delve into potential underlying causes from various angles:

**Relation to LLM Model Architecture**: The experiments above used
decoder-only architectures. The figure below adds the results of two
encoder-decoder models (Flan-T5-XXL and Flan-UL2) in multi-document
question answering. Similar to the findings above, both Flan-UL2 and
Flan-T5-XXL show a decrease in performance when the relevant information
is positioned in the middle of the input context, particularly as the
input context length exceeds 2048 tokens.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*rpfV3atx9VMgYQkv" />
<figcaption>Experiment of encoder-decoder models (Flan-UL2 and
Flan-T5-XXL). Source: Nelson F. Liu et al., 2023. Article Link: <a
href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></figcaption>
</figure>

Researchers speculate that encoder-decoder models may better leverage
their context window because their bidirectional encoders allow them to
process each document in the context of future documents. This ability
may help improve estimates of the relative importance of documents.

**Effect of Query and Context Information Placement**: A comparison
between a base LLM and its instruction-fine-tuned version reveals that
the "Lost in the Middle" phenomenon is inherent to LLMs, not a byproduct
of fine-tuning processes.

**Influence of Model Size and Instruction Fine-Tuning**: Further
analysis reveals that larger Llama models (13B and 70B) exhibit the
"Lost in the Middle" phenomenon more pronouncedly compared to the
smaller 7B model, indicating that model size and fine-tuning intricately
affect the phenomenon's manifestation.

### Potential Solution

A promising strategy to alleviate the "Lost in the Middle" issue in LLMs
is **Prompt Compression**, as exemplified by the LongLLMLingua approach
proposed by Microsoft. This method enhances LLMs' ability to detect key
information within prompts in long context scenarios, effectively
mitigating the "Lost in the Middle" issue. Particularly beneficial in
Retrieve and Generate scenarios, LongLLMLingua can lead to significant
cost savings, with up to \$28.5 saved per 1,000 samples for
GPT-3.5-Turbo (and potentially tenfold more for others), while
simultaneously boosting the performance of LLMs.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*dqe1SSYphYabS6KT" />
<figcaption>LongLLMLingua Performance with key information position,
Source:Microsoft. Link: <a
href="https://github.com/microsoft/LLMLingua.git">https://github.com/microsoft/LLMLingua.git</a></figcaption>
</figure>

**PARTOR**, developed as an advancement of the **RAPTOR** methodology
for Retrieve and Generate applications, significantly enhances the
capacity of LLMs to process and understand lengthy texts. As the
structure shown, it introduces an innovative methodology that
recursively vectorizes, clusters, and summarises text to build a
hierarchical tree structure that encapsulates various levels of
abstraction.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*f0VQchZFucce3Jyu" />
<figcaption>Schematic diagram of RAPTOR structure, Source: Langchain.
Link: <a
href="https://github.com/langchain-ai/langchain.git">https://github.com/langchain-ai/langchain.git</a></figcaption>
</figure>

RAPTOR enhances the way models interact with extensive documents,
enabling the retrieval of information across different levels of detail
and abstraction. This tree-structured approach allows for more effective
integration of information from lengthy documents, which is particularly
beneficial for tasks that require complex reasoning and understanding.
Experimental results show that this recursive summarisation retrieval
method outperforms traditional retrieval-augmented approaches in various
tasks. Notably, when combined with GPT-4, RAPTOR improves performance by
20% on the QuALITY benchmark test, especially in complex reasoning-based
question-answering tasks.

### Conclusion

In conclusion, the exploration of the "Lost in the Middle" phenomenon in
large language models (LLMs) underscores the critical importance of
understanding how these models process and interpret information within
lengthy texts. Through rigorous experimentation and analysis,
researchers have identified a significant challenge: the diminished
performance of LLMs when tasked with extracting information from the
middle sections of documents.

However, this challenge has not gone unaddressed. Recent advancements,
such as the RAPTOR model and its evolution into PARTOR for Retrieve and
Generate applications, represent promising steps forward. These
methodologies introduce innovative approaches to enhance LLMs' ability
to navigate and comprehend lengthy texts, particularly through recursive
summarisation retrieval methods and hierarchical tree structures.

By effectively integrating information across different levels of
abstraction and improving performance in tasks requiring complex
reasoning, these advancements hold considerable potential for advancing
the capabilities of LLMs. Furthermore, strategies like Prompt
Compression, exemplified by LongLLMLingua, offer additional avenues for
mitigating the "Lost in the Middle" issue and enhancing LLM performance
in long-context scenarios.

In essence, the journey to overcome the challenges posed by lengthy
texts in LLM processing is ongoing. As we continue to unravel the layers
behind these phenomena and explore potential solutions, we pave the way
for the next phase of evolution in language model research and
application. Through collaborative efforts and innovative methodologies,
we aim to refine LLMs' ability to navigate and process extensive texts
with the finesse and depth required for real-world applications,
ultimately shaping the future of artificial intelligence and
human-machine interaction.

### References

- Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language
  models use long contexts. arXiv:2307.03172
- Wenhan Xiong et al. Effective Long-Context Scaling of Foundation
  Models. arXiv:2309.16039
- Parth Sarthi et al. RAPTOR: Recursive Abstractive Processing for
  Tree-Organized Retrieval. arXiv:2401.18059

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=23d3c48614e9){width="1"
height="1"}


Solutions to Enhance LLM Performance in Long Contexts  Author   · Qingqin Fang (
<strong>
 ORCID:
</strong>
0009–0003–5348–4264) Introduction   In the era of AI breakthroughs, large language models (LLMs) are not just advancements; they are revolutions, transforming how we interact with technology, from casual conversations with chatbots to the intricate mechanisms behind sophisticated data analysis tools.

Rogue Scholar Posts

The longer the context, the better? Unlimited Context Length in Megalodon

Navigating the Long Context Conundrum: Challenges in Language Models’ Information Processing