Australian National University

Amanda Kau

**Improving the performance and application of Large Language Models**

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*Jb1q1366lzVqD2JU" />
<figcaption>Image generated with Google’s Gemini, 24
February 2024.</figcaption>
</figure>

### Author

* Amanda Kau (ORCID: [0009-0004-4949-9284](https://orcid.org/0009-0004-4949-9284))

Large language models (LLMs) like GPT-4, the engine of products like
ChatGPT, have taken centre stage in recent years due to their
astonishing capabilities. Yet, they are far from perfect. Many of us
have since learnt --- perhaps when asking ChatGPT a question or
employing it to write our reports --- that LLMs can hallucinate. This
happens when the LLM so eloquently expresses false knowledge that we
might be fooled by it. This major flaw has spurred the popularity of
Retrieval-Augmented Generation (RAG) techniques as a way to optimise an
LLM's responses.

To start off, this article will cover a brief overview of the key
concept behind RAG. Subsequently, a review on several issues behind the
retrieval step of RAG will be presented. In particular, this article
will review ideas on when and what should be retrieved, the quantity of
retrieved documents, effects of data quality, and RAG applied to
different domains. Strategies proposed by the research community will
also be briefly introduced for each challenge.

### **A Brief Introduction To RAG**

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*T6Zr0wCmrTOe83xp" />
<figcaption>Simple diagram about RAG.</figcaption>
</figure>

Using RAG is very much like performing Google searches while writing a
report. When the LLM attempts to generate a sentence without the
requisite knowledge, it needs to reference an external source to gain
that knowledge. Particularly prevalent in tasks like open domain
question answering (ODQA), the LLM may be enquired about a specific
topic which requires in-depth knowledge. Although LLMs are trained with
extensive amounts of diverse data, it is unreasonable to expect the LLM
to have internalised every potential answer and to deliver a response
that an expert might provide.

This is where RAG comes in handy. RAG comprises two components: the
retriever and the generator. The retriever gathers a set of supporting
documents from a given source based on an input query and passes them on
to the generator or LLM. This allows the LLM to avoid hallucinations,
thereby crafting more well-informed and relevant responses.

### **Addressing Retrieval Problems and Strategies**

#### 1. When and What To Retrieve {#when-and-what-to-retrieve}

Many retrieval-augmented language models utilise a single retrieval
process, which is particularly restrictive if the model is tasked to
generate a long passage of text. It is akin to writing a report whilst
only being permitted to reference documents once at the start.
Conversely, models that attempt to do multiple retrievals might do so at
fixed intervals. This seems illogical when we draw parallels to our
report writing scenario. It is like doing an online search every two
sentences we write. When the LLM is routinely bombarded by additional
information it does not require, the LLM might be disoriented, resulting
in it returning incoherent or irrelevant responses.

Some attempts to address this problem have been made, such as the
Forward-Looking Active REtrieval (FLARE) augmented generation method.
The premise of FLARE is simple: retrieve supporting information only
when the LLM signals low confidence indicating a lack of knowledge.

The subsequent challenge revolves around what information to retrieve
for the upcoming sentence. Imagine if you were tasked to find evidence
for a friend who's formulating an argument, but you must provide the
evidence without knowing what their next sentence will be. FLARE targets
this by prompting the LLM to generate a temporary potential sentence and
using this to retrieve documents --- essentially, asking the friend what
they might want to say next before doing your search.

#### **2. Quantity of Retrieved Documents** {#quantity-of-retrieved-documents}

The architectures of some RAG models confine them to retrieving a
certain number of supporting documents. Picture writing a report whilst
only having access to a fixed number of documents --- your viewpoints
and content would be largely based on that limited pool of knowledge. If
that set of documents were replaced, your argument might undergo a
drastic shift as your supporting information might change, resulting in
a lack of consistency in responses. LLMs are vulnerable in a similar way
and are sensitive to the quality of retrieved documents. That is, if
retrieved documents prove irrelevant to the context, the LLM's
output suffers.

One attempt to mitigate this involves clustering the training data as in
the case of MemGM, a memory-augmented generative model. This notion of
memory-augmentation means the model internalises the characteristics of
clusters of responses and uses them to aid response generation. To do
this, the data, which consists of query and response example pairs, is
grouped according to the similarity of queries. Subsequently, when MemGM
retrieves information from this database, the cluster average
representing the cluster's characteristics is returned instead of a set
of documents. This approach dilutes individual responses, hence
decreasing the LLM's sensitivity to individual documents. This also
serves as a means to provide generalised support from a large number of
documents to the LLM.

#### 3. Quality of Retrieved Documents {#quality-of-retrieved-documents}

**Data Recency:** The knowledge encapsulated within LLMs is frozen in
time from when they were trained, preventing them from being able to
provide up-to-the-minute information about the world. In fact, without
taking corrective measures, LLMs become increasingly outdated and
irrelevant over time. Nonetheless, they are still expected to interact
with the ever changing world, so data recency is of critical importance.

One proposed solution is to harvest the power of the Internet. Not only
does it contain the most current information, but years of refinement
have enabled Internet search engines to properly rank results, navigate
online safety and privacy concerns, and more. By allowing RAG models to
access the Internet dynamically, LLM models would always have access to
up-to-date information to craft accurate responses.

**Data Relevance:** Another factor influencing quality is the relevance
of the retrieved results to the context of the conversation or question
posed to the LLM. Ideally, irrelevant retrieved data should not harm the
LLM's performance, but this is not the case. Instead, irrelevant data
tends to mislead the LLM to either provide inaccurate answers or lose
sight of the initial context. This error compounds in multi-hop question
answering scenarios where successive questions cause the LLM to be
misled further with each question.

A simple natural language inference (NLI) model can be employed on top
of the LLM to circumvent this issue. Simply put, the NLI model
identifies the relevance of the retrieved information to the
conversational context. If the retrieved documents are deemed as
irrelevant, the prompt is given directly to the LLM without additional
documents to avoid confusing it. Additionally, if training the LLM is an
available option, studies have shown that even with a relatively small
dataset of 1,000 examples, LLMs can be trained to disregard irrelevant
documents.

#### 4. Problems Generalising to Specific Domains {#problems-generalising-to-specific-domains}

The final challenge discussed in this article occurs when we want the
LLM to have expertise in specific domains, like healthcare or finance.
It was briefly mentioned in the introduction to RAG that it contains a
retriever component. The retriever in the original RAG model was trained
on Wikipedia-based datasets. Therefore, RAG performs well when provided
with domain-specific data that adheres to the Wikipedia article format,
but fails with other data formats. A simple example is financial news,
which is usually succinct in the form of news flashes or tweets, and
does not provide additional context as it is assumed that the reader
possesses the necessary background knowledge. This adds another
disadvantage to the LLM as it will not have the necessary context to
understand the content presented to it.

The RAG model can be further fine-tuned to be domain-specific to target
this specific issue. In fact, a group of researchers proposed
RAG-end2end which extends the original RAG by jointly training all of
its components for a specific domain. However, this is very
computationally expensive and unfeasible for someone who lacks access to
multiple GPUs. Moreover, for broad topics like finance, it is key to
incorporate macroeconomic information and other contextual information
to give the LLM a well-rounded view of the situation, which remains an
area of research.

### Conclusion

In summary, this article has offered insight into some challenges that
may arise when employing RAG and some solutions proposed by the research
community. It highlights the importance of understanding data sources
and retrieval steps performed in RAG such that the retrieval segment may
benefit the LLM instead of harm its performance. In particular, one
should pay attention to when and what documents are retrieved, as well
as the quantity and quality of retrieved documents. Most importantly,
retrieved information should be relevant to the context and desired
domain for the generation of optimal responses.

#### Sources

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang,
Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented
Generation (Version 2). arXiv.
<https://doi.org/10.48550/ARXIV.2305.06983>

Tian, Z., Bi, W., Li, X., & Zhang, N. L. (2019). Learning to Abstract
for Memory-augmented Conversational Response Generation. In Proceedings
of the 57th Annual Meeting of the Association for Computational
Linguistics. Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics. Association for Computational
Linguistics. <https://doi.org/10.18653/v1/p19-1371>

Komeili, M., Shuster, K., & Weston, J. (2021). Internet-Augmented
Dialogue Generation (Version 1). arXiv.
<https://doi.org/10.48550/ARXIV.2107.07566>

Yoran, O., Wolfson, T., Ram, O., & Berant, J. (2023). Making
Retrieval-Augmented Language Models Robust to Irrelevant Context
(Version 1). arXiv. <https://doi.org/10.48550/ARXIV.2310.01558>

Siriwardhana, S., Weerasekera, R., Wen, E., Kaluarachchi, T., Rana, R.,
& Nanayakkara, S. (2023). Improving the Domain Adaptation of Retrieval
Augmented Generation (RAG) Models for Open Domain Question Answering. In
Transactions of the Association for Computational Linguistics (Vol. 11,
pp. 1--17). MIT Press. <https://doi.org/10.1162/tacl_a_00530>

Zhang, B., Yang, H., Zhou, T., Ali Babar, M., & Liu, X.-Y. (2023).
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large
Language Models. In 4th ACM International Conference on AI in Finance.
ICAIF '23: 4th ACM International Conference on AI in Finance. ACM.
<https://doi.org/10.1145/3604237.3626866>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5bcc28a03842){width="1"
height="1"}


<strong>
 Improving the performance and application of Large Language Models
</strong>
Author  Amanda Kau (ORCID: 0009-0004-4949-9284)  Large language models (LLMs) like GPT-4, the engine of products like ChatGPT, have taken centre stage in recent years due to their astonishing capabilities. Yet, they are far from perfect.

Understanding Retrieval Pitfalls: Challenges Faced by Retrieval Augmented Generation (RAG) models

Swinburne University of Technology

Aland Astudillo

**The AI Helper Turning Mountains of Data into Bite-Sized Instructions**

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*XwvfCyjF7dUWQ7Wj" />
<figcaption>Image generated with Google’s Gemini
12–02–2024.</figcaption>
</figure>

### Author

* Aland Astudillo (ORCID: [0009-0008-8672-3168](https://orcid.org/0009-0008-8672-3168))

LLMs have been changing the way the entire world deals with problems and
day-by-day tasks. To make them better for specific applications, they
need huge amounts of data and complex and expensive approaches to
training them. However, there are some challenges such as limited prompt
size, and limited context windows that make LLM not suitable for some.
This is a major issue for tasks that require huge amounts of
information. LLMLingua has been developed as a framework that helps LLMs
in addressing these limitations. In this article, we will review what
LLMLingua is, what it does, how it does it and what we can expect in the
near future.

Picture this scenario, you're tasked with teaching a group of aspiring
chefs how to prepare a complex gourmet meal. You could throw every
recipe detail and culinary term at them, hoping they absorb it all. But
wouldn't it be more effective to break down the instructions into clear,
concise steps, focusing only on the crucial techniques and ingredients?
This, in essence, is the magic of LLMLingua.

Instead of aspiring chefs, imagine training a team of AI apprentices to
tackle complex tasks. Like those aspiring chefs who thrive in learning
to prepare new dishes, LLMs thrive on information. But feeding LLMs with
mountains of raw data can be overwhelming, leading to slow processing
and limited performance. Enter LLMLingua, the innovative tool acting as
the AI chef, meticulously preparing information into bite-sized
instructions that these AI apprentices can easily digest and master.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*WauCih49veOTPy9V" />
<figcaption>Image generated with Google’s Gemini
12–02–2024.</figcaption>
</figure>

So, how does LLMLingua craft this culinary magic?

It employs two key techniques, similar to how a skilled chef optimises
a recipe:

1.  **Ingredient Trimming:** Imagine each piece of information as an
    ingredient in the recipe. LLMLingua, akin to a seasoned chef,
    identifies and discards unnecessary elements like filler words or
    irrelevant details. This "trimming" streamlines instructions,
    keeping only the essential components, just like using only the
    right spices to enhance a dish's flavour.
2.  **Recipe Refining**: Even carefully chosen ingredients need precise
    instructions. LLMLingua refines the wording and phrasing of
    instructions, ensuring the AI apprentices grasp the core meaning
    with perfect clarity. It's like the chef rewriting the recipe in
    clear, concise steps, leaving no room for confusion or
    misinterpretation.

Currently, several variations of LLMLingua exist, each tailored to
specific tasks and LLM architectures. One version optimises prompts for
question-answering models, while another focuses on improving
summarization tasks. These variations highlight the adaptability and
evolving nature of this technology.

But LLMLingua isn't alone in the quest for efficient AI. Techniques like
knowledge distillation and parameter reduction also aim to streamline
models. What sets LLMLingua apart is its focus on prompt manipulation,
offering a unique and flexible approach.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1000/0*hAjLbEGPdLOJyziQ" />
<figcaption>Scheme of the LLMLingua framework. Image created by Aland
Astudillo.</figcaption>
</figure>

Imagine teaching a friend a complex recipe. You wouldn't overwhelm them
with every detail; instead, you'd break it down, focusing on key steps
and ingredients. LLMLingua operates similarly, streamlining information
for large language models (LLMs) to achieve peak performance. Here's a
deeper dive into its technical pipeline:

1.  **Input Preprocessing**: The journey begins with the "raw"
    information intended for the LLM. LLMLingua analyses it, identifying
    irrelevant elements like filler words or redundant instructions.
    Think of skimming unnecessary details from a recipe book, keeping
    only the crucial steps. This initial "trimming" helps reduce the
    information load on the LLM.
2.  **Tokenization**: Next, LLMLingua breaks down the preprocessed text
    into individual units called "tokens," similar to words in a
    sentence. This facilitates further analysis and manipulation.
    Imagine separating the ingredients listed in your recipe into
    individual units --- flour, eggs, milk, etc.
3.  **Budget Control**: LLMLingua employs a "budget controller" to
    ensure compression doesn't compromise information integrity. By
    setting a desired compression ratio, it controls how much
    information can be discarded while maintaining optimal performance
    for the LLM. This is like deciding how much of each ingredient to
    use without altering the final dish drastically.
4.  **Iterative Compression**: Now comes the magic. LLMLingua utilises
    an iterative algorithm to refine the tokenized information. Each
    iteration analyses the remaining tokens, identifying opportunities
    for further compression while considering their importance for the
    LLM's task. Think of repeatedly reviewing your recipe steps,
    replacing complex techniques with simpler alternatives whenever
    possible.
5.  **Instruction Tuning**: Beyond simply removing elements, LLMLingua
    also refines the wording and phrasing of the remaining instructions.
    Imagine rewriting your recipe steps for clarity and accuracy,
    ensuring your friend understands them perfectly. This ensures the
    LLM receives clear and concise instructions, minimising room for
    misinterpretation.
6.  **Output Generation**: Finally, the compressed and refined
    information is presented to the LLM as a "streamlined recipe." The
    LLM processes it efficiently, achieving higher performance with
    faster response times and reduced resource consumption. It's like
    your friend effortlessly executing the optimised recipe, producing a
    delicious dish efficiently.

This is just a simplified overview and the actual LLMLingua pipeline
involves complex algorithms and deep learning techniques that can be
reviewed in the research article (See the references).

**Key concepts: Entropy and perplexity**

Additionally, there are two key concepts that allow the control engine
in the LLMLingua framework: entropy and perplexity. These metrics act as
indicators of information content and difficulty, allowing researchers
to gauge the effectiveness of LLMLingua's compression techniques.

**1. Entropy**

Imagine tossing a coin. With one possible outcome (heads or tails), the
entropy is low. Now, consider a hundred-sided die --- with many more
possibilities --- the entropy is higher. Similarly, entropy in language
measures the unpredictability of the next word given the previous ones.
Longer, more complex sentences typically have higher entropy than
shorter, simpler ones.

LLMLingua aims to reduce the entropy of instructions fed to LLMs. By
removing redundant information and focusing on key elements, it
essentially makes the next word more predictable, like simplifying the
die to fewer sides. This lowers the information load on the LLM, leading
to more efficient processing.

**2. Perplexity**

Perplexity builds upon entropy but is presented as an inverse
probability. A lower perplexity value signifies higher predictability,
indicating that the model can more easily anticipate the next word.
Conversely, a high perplexity suggests complex, unpredictable language,
making it harder for the model to process.

In the context of LLMLingua, a decrease in perplexity after compression
reflects improved efficiency. It means the LLM can understand the
compressed instructions with the same accuracy as the original but with
less effort. This translates to faster response times and lower
computational costs.

It's important to note that:

- Both entropy and perplexity are complex measures with nuances beyond
  this simplified explanation.
- LLMLingua's goal isn't to achieve the absolute lowest entropy or
  perplexity, but to find the optimal balance between compression and
  information fidelity.

So, while entropy and perplexity might seem like abstract concepts, they
play a crucial role in understanding how LLMLingua achieves its
efficiency gains.

**Why is LLMLingua the secret ingredient for AI success?**

LLMs are changing the world, translating languages on the fly, composing
personalised poems, and even answering your questions with remarkable
accuracy. But just like our aspiring chefs, their hunger for information
can create limitations. Imagine translating entire novels
word-for-word --- it's slow, resource-intensive, and ultimately
unsatisfying. LLMLingua tackles this head-on, paving the way for faster,
more efficient, and impactful AI applications:

- **Speedy Chatbots**: Ever feel like waiting forever for a chatbot
  response? LLMLingua compresses your questions, enabling chatbots to
  understand you instantly and respond at lightning speed. It's like
  having a personal AI assistant who's always on top of their game,
  eliminating frustrating wait times.
- **Translation on the Go**: Imagine translating entire documents in
  seconds! LLMLingua empowers translation tools to process information
  with laser precision, breaking down language barriers faster than
  ever. Think of it as a universal translator, seamlessly bridging
  communication gaps across cultures and languages like a skilled
  multilingual chef seamlessly navigating different cuisines.
- **Research Revolution**: Picture sifting through mountains of
  scientific data in minutes. LLMLingua empowers AI assistants to
  analyse complex research papers with unmatched efficiency,
  accelerating scientific breakthroughs and discoveries. It's like
  having a tireless research partner who can summarise vast amounts of
  information, highlighting key findings and saving researchers
  countless hours.
- **AI for Everyone**: From personalised learning experiences to
  efficient task management, LLMLingua can revolutionise how AI
  assistants interact with us. Imagine an AI tutor who personalises
  learning modules based on your individual needs or an assistant who
  manages your schedule with laser-focus, all thanks to the power of
  concise and optimised instructions.

But, wait...how do we solve the problem of the limited context size? Let
me introduce you to **LongLLMLingua**!

### What is LongLLMLingua?

Think of LongLLMLingua as the master chef, overseeing the entire
culinary experience. It builds upon LLMLingua's foundation, applying its
compression techniques not just to single instructions, but to entire
sequences of information. This empowers LLMs to process long
contexts --- think research papers, dialogue history, or complex
narratives --- with impressive efficiency and improved performance.

**How does it work?**

LongLLMLingua operates in three key phases:

- **Coarse Compression**: First, it performs a "rough cut," analysing
  the entire context and identifying large sections it can safely
  discard. Imagine skimming extraneous details from a recipe book,
  focusing only on the core steps for each dish.
- **Fine Compression**: Next, it dives deeper, applying LLMLingua's
  techniques to refine remaining information. Think of meticulously
  preparing each dish, ensuring optimal ingredient selection and precise
  instructions.
- **Reordering for Coherence**: Finally, LongLLMLingua ensures the
  compressed information maintains its original meaning and flow.
  Imagine plating each dish in a visually appealing and sequential
  manner, ensuring a cohesive dining experience.

**Why is LongLLMLingua important?**

Long contexts are crucial in various AI applications:

- **Question Answering**: Imagine needing context from several articles
  to answer a complex question. LongLLMLingua helps LLMs retain key
  information across documents, leading to more accurate answers.
- **Machine Translation**: Imagine translating entire books instead of
  single sentences. LongLLMLingua ensures coherence and preserves
  meaning despite lengthy input.
- **Dialogue Systems**: Imagine chatbots understanding intricate
  conversation history. LongLLMLingua enables them to maintain context,
  leading to more natural and engaging interactions.

Real-world examples of LongLLMLingua in action:

- **LongBench Benchmark**: This benchmark measures LLM performance in
  long context scenarios. When applied to GPT-3.5-Turbo, LongLLMLingua
  achieved a 17.1% performance boost with 4x fewer tokens.
- **ZeroScrolls Benchmark**: This benchmark focuses on reading
  comprehension in long contexts. LongLLingua reduced costs by \$27.4
  per 1,000 samples while maintaining performance.

Remember, LongLLMLingua isn't limited to cooking analogies. It's a
powerful tool revolutionising how LLMs handle long contexts,
accelerating progress in various AI domains. By improving efficiency and
maintaining information fidelity, LongLLMLingua opens doors to a future
where AI interactions are more contextual, seamless, and impactful.

**Conclusion**

In conclusion, both LLMLingua and LongLLMLingua are not just
technological advancements; they're culinary metaphors come to life.
LLMLingua acts as the skilled chef, meticulously preparing information
into bite-sized instructions for AI apprentices. LongLLMLingua takes the
baton, transforming into the master chef, seamlessly handling entire
sequences of information like preparing a multi-course feast. Together,
they're revolutionising how LLMs interact with information, paving the
way for a future where:

- AI responses are faster, more accurate, and contextually relevant.
- Machine translation transcends sentence-by-sentence limitations,
  unlocking global communication on a grand scale.
- AI assistants understand our interactions in their entirety, leading
  to more natural and engaging dialogues.

As these technologies continue to evolve, the possibilities are endless.
Imagine AI tutors adapting to your learning style based on years of
educational data, or chatbots understanding months of conversation
history to anticipate your needs with uncanny accuracy.

The future of AI is hungry for efficiency and understanding, and
LLMLingua and LongLLMLingua are serving up the perfect recipe for
success. By empowering LLMs to process information like culinary
masters, they're opening doors to a new era of seamless interaction,
accelerated discovery, and boundless communication.

**Sources**

- LLMLingua research paper: Jiang et al., 2023. LLMLingua: Compressing
  prompts for accelerated inference of Large Language Models. arXiv
  2310.05736v2 <https://arxiv.org/abs/2310.05736>
- LongLLMLingua research paper: Jiang et al., 2023. LongLLMLingua:
  accelerating and enhancing LLMs in long context scenarios via prompt
  compression. arXiv 2310.06839v1 <https://arxiv.org/abs/2310.06839>
- Microsoft LLMLingua page: LLMLingua: Compressing Prompts for
  Accelerated Inference of Large Language Models --- Microsoft Research
  <https://www.microsoft.com/en-us/research/project/llmlingua/>
- Hugging Face LLMLingua Space: LLMLingua --- a Hugging Face
  <https://huggingface.co/spaces/microsoft/LLMLingua>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3a632385b206){width="1"
height="1"}


<strong>
 The AI Helper Turning Mountains of Data into Bite-Sized Instructions
</strong>
Author  Aland Astudillo (ORCID: 0009-0008-8672-3168)  LLMs have been changing the way the entire world deals with problems and day-by-day tasks. To make them better for specific applications, they need huge amounts of data and complex and expensive approaches to training them.

What is LLMLingua?

Mark Dingemanse

There is a minor industry in speech science and NLP devoted to detecting
and removing disfluencies. In
[some](https://aclanthology.org/2022.acl-long.385/){data-type="link"
data-id="https://aclanthology.org/2022.acl-long.385/" target="_blank"
rel="noreferrer noopener"} of
[our](https://aclanthology.org/2022.lrec-1.126/){data-type="link"
data-id="https://aclanthology.org/2022.lrec-1.126/" target="_blank"
rel="noreferrer noopener"} recent
[work](https://aclanthology.org/2023.sigdial-1.45/){data-type="link"
data-id="https://aclanthology.org/2023.sigdial-1.45/" target="_blank"
rel="noreferrer noopener"} we're showing that treating talk as sanitised
text can adversely impact voice user interfaces. However, this is still
a minority position. Googlers Dan Walker and Dan Liebling represent the
mainstream view well in this [blog
post](https://ai.googleblog.com/2022/06/identifying-disfluencies-in-natural.html):

> People don't write in the same way that they speak. Written language
> is controlled and deliberate, whereas transcripts of spontaneous
> speech (like interviews) are hard to read because speech is
> disorganized and less fluent. One aspect that makes speech transcripts
> particularly difficult to read
> is [disfluency](https://en.wikipedia.org/wiki/Speech_disfluency),
> which includes self-corrections, repetitions, and filled pauses (e.g.,
> words like "*umm*", and "*you know"*). Following is an example of [a
> spoken
> sentence](https://sla.talkbank.org/TBB/ca/CallHome/eng/4365.cha) with
> disfluencies from the [LDC CALLHOME
> corpus](https://catalog.ldc.upenn.edu/LDC97T14):
>
> "*But that's it's not, it's not, it's, uh, it's a word play on what
> you just said.*"
>
> It takes some time to understand this sentence --- the listener must
> filter out the extraneous words and resolve all of the *nots*.
> Removing the disfluencies makes the sentence much easier to read and
> understand:
>
> "*But it's a word play on what you just said.*"

Fair enough, you might say. Everyone understands there are use cases for
identifying and sometimes removing these items, for instance (possibly)
when subtitling or transcribing spoken material for written consumption.
And surely in this example, the sanitized version seems "much easier to
read and understand" than the original.

## Easier for whom and relative to what? {#easier-for-whom-and-relative-to-what .wp-block-heading}

Hold on. Easier to read for whom? Easier to understand relative to what?
It never hurts to go back to the source. Here is a more precise
transcript of the interaction from the CALLHOME corpus. The target
utterance appears at line 11:

<figure class="wp-block-image size-large">
<a href="https://ideophone.org/files/image-36.png"><img
src="https://ideophone.org/files/image-36-1024x471.png"
class="wp-image-5663" loading="lazy" decoding="async"
srcset="https://ideophone.org/files/image-36-1024x471.png 1024w, https://ideophone.org/files/image-36-300x138.png 300w, https://ideophone.org/files/image-36-768x353.png 768w, https://ideophone.org/files/image-36-1536x707.png 1536w, https://ideophone.org/files/image-36-500x230.png 500w, https://ideophone.org/files/image-36.png 2013w"
sizes="(max-width: 1024px) 100vw, 1024px" width="1024" height="471"
alt="B yeah A so he would like to have a place where eh they can come and visit him B yeah A and so, and so B /laugh/ A its fine with [me B [but anyway- never [mind A /laugh/ what? B i- i dont want to say A /laugh/ B [but thats its not- its not- its- uh its a word play [on what you just said A [oh B its kind_of a switcheroo A whats- what is it? B /laugh/ well th- th- he or they can they can visit him and come on him A ah ha ha B instead of coming and visiting A yeah well okay /laugh/ B i just" /></a>
</figure>

What happens at line 11 cannot be understood without the immediate prior
context. Technically (that is, using the analytical tools of
conversation analysis) we can describe it as a case of 'disfluency' or
'hesitation' deployed to do the interactional work of showing an
orientation to inappropriateness (Lerner 2013) --- in this case of a
lame pun with some sexual innuendo. The pun ("kind of a switcheroo" as B
says) is a juvenile word play that exchanges "come and visit him" for
"visit and come on him" (15). After an initial laugh (5), B spends
considerable work drawing attention to what crossed his mind while at
the same time casting doubt on its tellability: "anyway- never mind" and
"I don't want to say" (7, 9). It is quite remarkable to see so many
evasive moves. All this forms the backdrop to the turn in focus:

> *but that's it's not- it's not- its- uh- it's a word play on what you
> just said* (11)

When something is produced after so much evasion and in such a
belaboured, disfluent, hesitant way, you can bet the delivery is
meaningful in itself. **The hemming and hawing is the point.** It
contributes to putting up a smokescreen of ambiguous commitment to what
might become (we already sense at that point) something problematic. The
deflationary "kind of switcheroo" (13) further aims to diffuse a
delicate situation. Only after A's *second* request to deliver the
goods, B produces the word play. And then the whole thing falls flat, as
seen among other things by A's performative laughter particles, B's
explanation (any pun that needs an explanation is dead on arrival), A's
non-commital "yeah well okay", the subdued laughter by both, and B's
self-deprecating "I just" (16-19).

## An infrastructure for collaborative indiscretion {#an-infrastructure-for-collaborative-indiscretion .wp-block-heading}

When we live through episodes like this in everyday life, we get all
this in a split second. The slipperiness of jokes and puns, the
inescapable social accountability that always hovers over anything we
say, and the degree to which we depend on others for realizing
indiscretions. We get it when others do it, and we do it ourselves. As I
said, the hemming and hawing is the point. Disfluencies are a key
interactional tool that we use to navigate interactionally delicate
episodes (Jefferson 1974). Gene Lerner (2013) has described hesitations
in this kind of context as an *infrastructure for collaborative
indiscretion*. The point: there is a great deal of order and regularity
even to things like hesitations and disfluencies.

Let's back up a bit. First we have an original utterance, warts and all,
situated in an actual interaction, formulated in a way that displays
self-consciousness, saturated with accountability. Then we have, in the
Googlers' version, an abbreviated, regularized, decontextualized version
that is emptied of all significance, all of the wrinkles ironed out. The
original relates to the sanitized form approximately as a living,
fluttering butterfly relates to a pinned and preserved specimen. The
latter may be easier to classify, POS-tag, vectorize --- which is
probably what most NLPers mean when they say "easier to read and
understand". But it is not the same.

(And note, too, that even the cleaned up version is not going to lead to
better understanding of what's actually going on. After all, the
hesitations and so on only served to foreshadow that a word play crossed
the speakers' mind, which is only revealed *after* the whole back and
forth. Good luck to your co-reference resolution, sentiment analysis, or
stance detection algos!)

## Why this matters {#why-this-matters .wp-block-heading}

How we say something has implications for what it means, how we want it
to be taken up, how we expect to be held accountable for it (Jefferson
1974, Clift 2016). People in interaction frequently mobilize
disfluencies to stall for time, to display uncertainty, to foreshadow
disagreement, to find an ally to co-produce an indiscretion, and a great
many other things. Perhaps there are contexts or applications where it
may be useful to detect or even hide disfluencies, but erasing them
wholesale should raise red flags. And yet that appears to be the sole
purpose of Walker and Liebling's work. As they write,

> we created machine learning (ML) algorithms that identify disfluencies
> in human speech. Once those are identified , we can remove the extra
> words to make transcripts more readable. This also improves the
> performance of [natural language
> processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP)
> algorithms that work on transcripts of human speech.

If we 'clean up' transcripts of talk to look more like sanitised text
data, NLP algorithms trained on text data also perform better on the
cleaned-up transcripts. I bet they do! And again, for some purposes,
this may be useful. But it so happens that for *this* particular case
---which, remember, I didn't pick, they did--- the act of cleaning up
actually conceals what happened and why. Something essential was lost in
the process. Not just the disfluencies, but our power to understand what
people *do* when they wield disfluencies.

## Scaling up and losing touch {#scaling-up-and-losing-touch .wp-block-heading}

Let's think ahead. When feeding only 'sanitised' transcripts like this
to NLP algorithms, one thing you're doing is you're deciding, before any
analysis, that disfluencies and the like don't matter for whatever you
want to study or classify or understand. This is a big choice to make.
How sure can you be of this for say, sentiment analysis or emotion
detection? Why assume that everything relevant will be in the 'content'
words, when human interaction is [famous for its flexibility and
metalinguistic
prowess](https://ideophone.org/interactive-repair-and-the-foundations-of-language/){data-type="post"
data-id="6645"}?

As a side effect, you may also be enabling ML algorithms to pick up and
reproduce, say, lame jokes without the hedging and disfluency they are
sometimes produced with (as in this case). It doesn't take a lot of
imagination to see how scaling this up might lead to serious problems
(see Birhane et al. 2023 on the not so innocent nature of scaling). The
case Walker and Liebling picked happened to be a relatively tame pun.
Racism, sexism, gaslighting, and all forms of subtle and not so subtle
verbal abuse --- these occur in real data, and the way they are produced
and responded to is immensely important for a deeper understanding of
human interaction.

By removing disfluencies and turning situated talk into sanitised text,
you're removing all public evidence of the very resources people
mobilize to manage social accountability and navigate episodes of
interactional delicateness. You're sabotaging your own ability to
understand how ethical norms and values are socially enforced in
interaction. You're obscuring how stance and epistemics actually work.
You're forcing rich, ambiguous, human interaction into a straitjacket of
tokenizers and transformers. **You are, fundamentally, dehumanizing
human interaction.**

## Talk, warts and all {#talk-warts-and-all .wp-block-heading}

Linguists and computer scientists have long been conditioned to separate
competence from performance, and to regard the latter as essentially
disposable. If pristine competence is the supreme goal, only to be
reached by excavating it from under the rubble of performance, no wonder
that we work hard to remove all evidence of the human in our texts and
transcripts (Dingemanse & Enfield 2023).

However, even though the competence/performance distinction has loomed
large in NLP, and likely forms part of the cultural backdrop to
unexamined choices like this (the standard 'stopword removal' procedure
is another example), it's not the only game in town and never has been.
A century ago, anthropologist Bronislaw Malinowski wrote:

> Indeed behaviour is a fact, a relevant fact, and one that can be
> recorded. And foolish indeed and short-sighted would be the \[wo\]man
> of science who would pass by a whole class of phenomena, ready to be
> garnered, and leave them to waste, even though \[s\]he did not see at
> the moment to what theoretical use they might be put!
>
> Malinowski 1922:20

If we take this *whole class of phenomena* to include human interactive
behaviour, recorded and represented as faithfully as possible, then it
should be clear today that not only are their ample theoretical uses for
it, but also practical ones. The theoretical uses include forming a
sophisticated understanding of how people exchange information and build
social relations through situated talk; a critical prerequisite to any
serious work on human language technology. The practical uses include
building on such insights to make language technologies that do not
sanitise and dumb down what we say, but that instead harness our
linguistic abilities --- including our formidable and sophisticated
abilities to delay, hesitate, backtrack, and repair. As conversational
agents and voice-driven interfaces grow increasingly ubiquitous, now is
the time to [move beyond text-bound conceptions of
language](https://aclanthology.org/2022.acl-long.385/), and to start
taking talk seriously.

## References {#references .wp-block-heading}

- Birhane, A., Prabhu, V. U., Han, S., Boddeti, V., & Luccioni, S.
  (2023). *Into the LAION's Den: Investigating Hate in Multimodal
  Datasets*. Presented at the *Thirty-seventh Conference on Neural
  Information Processing Systems*, *Datasets and Benchmarks Track*.
  Available at <https://openreview.net/forum?id=6URyQ9QhYv>
- Clift, Rebecca. 2016. *Conversation Analysis*. Cambridge: Cambridge
  University Press.
- Dingemanse, Mark, and N. J. Enfield. 2023. 'Interactive Repair and the
  Foundations of Language'. *Trends in Cognitive Sciences*.
  <https://doi.org/10.1016/j.tics.2023.09.003>.
- Jefferson, Gail. 1974. 'Error Correction as an Interactional
  Resource'. *Language in Society* 2: 181--99.
- Lerner, Gene. 2013. 'On the Place of Hesitating in Delicate
  Formulations: A Turn-Constructional Infrastructure for Collaborative
  Indiscretion'. In *Conversational Repair and Human Understanding*,
  edited by Makoto Hayashi, Geoffrey Raymond, and Jack Sidnell, 95--134.
  Studies in Interactional Sociolinguistics 30. Cambridge: Cambridge
  University Press.
- Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. 'The
  Timing Bottleneck: Why Timing and Overlap Are Mission-Critical for
  Conversational User Interfaces, Speech Recognition and Dialogue
  Systems'. In *Proceedings of the 24th Annual SIGdial Meeting on
  Discourse and Dialogue*. Prague.
  <https://aclanthology.org/2023.sigdial-1.45/>
- Malinowski, Bronislaw. 1922. *Argonauts Of The Western Pacific*.
  London: Routledge & Kegan Paul.
- Walker, Dan, and Dan Liebling. 2022. 'Identifying Disfluencies in
  Natural Speech'. Google Research Blog. 30 June 2022.
  <https://blog.research.google/2022/06/identifying-disfluencies-in-natural.html>.


There is a minor industry in speech science and NLP devoted to detecting and removing disfluencies. In some of our recent work we’re showing that treating talk as sanitised text can adversely impact voice user interfaces. However, this is still a minority position. Googlers Dan Walker and Dan Liebling represent the mainstream view well in this blog post:  Fair enough, you might say.

Sawing off the branch you’re sitting on

It's easy to forget amidst a rising tide of synthetic text, but language
is not actually about strings of words, and language scientists would do
well not to chain themselves to models that presume so. For apt and
timely commentary we turn to Bronislaw Malinowski who wrote:

> there is a series of phenomena of great importance which cannot
> possibly be recorded by questioning or computing documents, but have
> to be observed in their full actuality. Let us call them the
> inponderabilia of actual life.

In follow-up work, Malinowski has critiqued the unexamined use of
decontextualised strings of words as a proxy for Meaning:

> To define Meaning, to explain the essential grammatical and lexical
> characters of language on the material furnished by the study of
> \[written records\], is nothing short of preposterous in the light of
> our argument. Yet it would be hardly an exaggeration to say that 99
> per cent of all linguistic work has been inspired by the study of dead
> languages or at best of written records torn completely out of any
> context of situation.

Malinowski did not write this on his substack, in an op-ed in the New
York Times, or in a preprint on arxiv. He spent time doing primary
fieldwork, lived with people whose language he learned, and based on
close observation of language in everyday use came to an informed
critique of his contemporaries' extreme reliance on strings of text.

He did all this over a century ago, and yet here we are, running in
circles around stochastic text generators or text regurgitators, as we
may call the LLMs that today excel in next token prediction. Makes me
think of something Wittgenstein wrote in another context, for a similar
problem: "A picture held us captive. And we could not get outside it,
for it lay in our language and language seemed to repeat it to us."

- Malinowski, B. (1922). *Argonauts Of The Western Pacific*. London:
  Routledge & Kegan Paul.
- Malinowski, B. (1923). The problem of meaning in \[underdescribed\*\]
  languages. In C. K. Ogden & Richards (Eds.), *The meaning of meaning*
  (pp. 296--336). London: Kegan Paul.

\* I write \[underdescribed\] where Malinowski had 'primitive' to draw
attention to the following: Malinowski wrote at a time when scientific
racism meant that "modern" or "civilized" languages were habitually
contrasted with "primitive" or "savage" ones --- even as his own work
helped demolish that distinction and showed the primacy of language use
in everyday life across societies.

**Update July 2023**: If, despite this, you're interested in "Large
Language Models", we have some relevant work for you: [Opening up
ChatGPT](https://opening-up-chatgpt.github.io) (and [accompanying
paper](https://dl.acm.org/doi/10.1145/3571884.3604316){rel="noreferrer noopener"
target="_blank"}, and [blog
post](https://ideophone.org/opening-up-chatgpt-evidence-based-measures-of-openness-and-transparency-in-instruction-tuned-large-language-models/){data-type="post"
data-id="6310"}).


It’s easy to forget amidst a rising tide of synthetic text, but language is not actually about strings of words, and language scientists would do well not to chain themselves to models that presume so. For apt and timely commentary we turn to Bronislaw Malinowski who wrote:  In follow-up work, Malinowski has critiqued the unexamined use of decontextualised strings of words as a proxy for Meaning:  Malinowski did not write this on his substack,

Malinowski (1922) on Large Language Models

Amanda Dobbyn

library(tidyverse)library(monkeylearn)  This is a story (mostly) about how I started contributing to the rOpenSci package monkeylearn. I can’t promise any life flipturning upside down, but there will be a small discussion about git best practices which is almost as good 🤓. The tl;dr here is nothing novel but is something I wish I’d experienced firsthand sooner.

Messaggi di Rogue Scholar

Understanding Retrieval Pitfalls: Challenges Faced by Retrieval Augmented Generation (RAG) models

What is LLMLingua?

Sawing off the branch you’re sitting on

Malinowski (1922) on Large Language Models

Monkeying around with Code and Paying it Forward