Australian National University

Vaibhav Khobragade

#### Bridging Human Perception and AI's Future: The Convergence of Visual Understanding and Semantic Networks

<figure>
<img src="https://cdn-images-1.medium.com/max/900/0*CKgpNGbkidSp4EsM" />
<figcaption>Integrating Vision, Language, and Knowledge in AI by ChatGPT
(DALL-E 2).</figcaption>
</figure>

### Author

· Vaibhav Khobragade (**ORCID:**
[0009--0009--8807--5982](https://orcid.org/0009-0009-8807-5982))

### **Introduction**

The fusion of Vision-Language Models (**VLMs**), Generative Models, and
Knowledge Graphs (**KGs**) is reshaping how artificial intelligence (AI)
understands and interacts with the world. For example, *Automatic Image
Description for Visually Impaired Users* exists in which AI can generate
accurate and detailed descriptions of images, aiding visually impaired
users in understanding their content.

VLMs integrate visual and textual information, enabling tasks like image
captioning, while Generative Models create new, diverse content such as
generating text, video, and images. KGs, with their structured
representation of real-world entities and relationships, enhance these
models by providing deep, contextual insights. This combination unlocks
more accurate, relevant, and contextually rich AI capabilities, from
improved search engines to creative content generation, making
technology more intuitive and closer to human-like understanding and
creativity.

### **Knowledge Graphs**

A knowledge graph (KG) is a structured representation of real-world
entities and their interrelationships, encapsulating complex information
in a graph-structured form where nodes represent entities, and edges
denote relationships. For instance, 'Leonardo DiCaprio' and 'Inception'
are nodes linked by an edge representing his role in the film. For more
details, refer
to [this](https://medium.com/@researchgraph/enhancing-language-models-the-role-of-knowledge-graph-augmentation-in-overcoming-llm-challenges-c232b5e9328f).

### **Vision-language Models**

Vision-language models (VLMs) are tools that combine what they see in
pictures with written words to both understand and create new content.
They are good at making captions for pictures and answering questions
about what's in an image. DALL-E is a prime example of VLMs. It
showcases the capabilities of VLMs by generating images from textual
descriptions, combining the fields of computer vision and natural
language processing. For instance, if given the prompt "A futuristic
city at sunset,". It generates the following image

<figure>
<img src="https://cdn-images-1.medium.com/max/847/0*QFKhHuhRExW6wOSw" />
<figcaption>A futuristic city at sunset by DALL-E.</figcaption>
</figure>

### **Generative Models**

Generative models, often part of the broader VLM framework, are capable
of creating new content, such as images or text, by learning the
underlying distribution of training data, enabling applications like
text-to-image synthesis (as in the above instance) and style transfer.

### **How knowledge graphs work with LLMs**

#### **Tuning Vision-Language Models With Dual Knowledge Graph**

Previous techniques of tuning Vision-Language models like
[CLIP-Adapter](https://arxiv.org/abs/2110.04544),
[TaskRes](https://arxiv.org/abs/2211.10277), and
[Tip-Adapter](https://arxiv.org/abs/2111.03930) have two main issues
associated with them. First, they focus on adapting models using
knowledge from a single modality (either text or images) which means
that they don't fully utilise the **relationship** between the two
modalities.

For example, consider an image of a cat and a dog playing together.
Focusing solely on the image might lead the model to identify individual
objects (cat, dog) but miss the interaction between them. Likewise,
focusing solely on the text description might not capture the visual
details of the scene.

Second, they do not fully leverage the structured knowledge of
relationships between different concepts, particularly in scenarios with
limited data that can lead to several issues such as suboptimal
solutions, bias towards partial attributes, and inefficient transfer and
generalisation.

For example, if the VLM encounters an image of a chef holding a knife,
it might struggle to understand the specific action (chopping) without
additional knowledge about the relationship between chefs, knives, and
food preparation.

The new method of using Dual Knowledge Graphs addresses the limitations
of current VLM tuning methods, particularly under scenarios with limited
data. The core innovation is the **GraphAdapter**, a strategy that
leverages dual KGs --- separate but interconnected graphs for textual
and visual knowledge --- to enrich the model's understanding and
generation capabilities.

By creating two interconnected KGs --- one for text and another for
visual information --- the GraphAdapter enables VLMs to draw on a richer
set of relationships and semantic understandings. This dual-graph
approach allows the model to better capture the nuances of how objects
and concepts are related across visual and textual data, leading to more
accurate and context-aware outputs.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*VTIP54a3ARo94oQe" />
<figcaption>Comparison between (a) Zero-shot CLIP, (b) CLIP-Adapter (c)
TaskRes, and (d) proposed GraphAdapter by (Li et al.,
2023) Article link: <a
href="https://arxiv.org/abs/2309.13625">https://arxiv.org/abs/2309.13625</a></figcaption>
</figure>

In the above image, while the output classifications might look similar
at a glance, the essence of GraphAdapter's innovation is not just in the
accuracy of classification but in how it achieves this result. By
leveraging dual Knowledge Graphs, GraphAdapter can potentially offer
richer contextual understanding and generalisation capabilities,
especially in "low-data regimes" or when faced with nuanced,
fine-grained distinctions between classes. This approach marks a
significant shift from previous methods, aiming to deeply integrate
cross-modal knowledge and structured relationships into the adaptation
process for Vision-Language Models.

A direct comparison of the results across different methods, including
Zero-shot CLIP, CLIP-Adapter, TaskRes, and their proposed GraphAdapter,
has been shown in the image below. It shows how GraphAdapter
consistently outperforms the baseline methods across different numbers
of shots, underscoring the effectiveness of integrating dual Knowledge
Graphs for structured knowledge exploitation.

#### **Working**

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*DyWA5ETiGg7IC72e" />
<figcaption>The pipeline of GraphAdapter, which is composed of the dual
knowledge graph and CLIP (Contrastive Language-Image Pre-training )based
image classification pipeline by (Li et al., 2023) Article link: <a
href="https://arxiv.org/abs/2309.13625">https://arxiv.org/abs/2309.13625</a></figcaption>
</figure>

1. **Starting point (Input Images and Texts):** The process starts with
 images and their corresponding texts. These could be things like
 photos of animals along with descriptions or labels.
2. **Transformation (Text and Visual Encoders):** The text descriptions
 are processed by a Text Encoder, which transforms the text into a
 vector by word embedding like word2vec with which the computer can
 work more efficiently. Similarly, the Visual Encoder processes the
 images, turning them into matrices.
3. **Mapping relationships (Dual Knowledge Graphs Creation):** *For
 text,* a Textual-Subgraph is created. It's like making a map that
 shows how different words or phrases are related based on their
 meanings. *For visuals*, a visual sub-graph aims to capture and
 model the relationships between different visual elements and
 concepts within the images. This involves understanding what the
 image contains (e.g., objects, scenes, actions) and how these
 elements are related to each other in terms of their visual and
 contextual relationships.
4. **Refining Connections (Convolutional Networks (GCNs)):** These are
 special tools that help to blend and refine the information in our
 text and visual maps (the subgraphs), making sure the connections
 and relationships are as accurate and helpful as possible. K and d
 are the number of classes and the dimension of textual/visual
 features.
5. **Fusion and Adjustment (GraphAdapter):** This is the heart of the
 process. It takes the refined maps of texts and visuals and combines
 them, ensuring that the final output makes sense both visually and
 textually. It's like making sure the description "a big red apple"
 matches with pictures of apples, not bananas.
6. **Final Output:** The final step produces adjusted or enhanced text
 and image features, ensuring that images match their descriptions
 accurately and vice versa. This can be used to make better image
 recognitions, more accurate image descriptions, and so on.

#### **Performance Comparison**

The model evaluates its approach, GraphAdapter, on few-shot learning
tasks using 11 datasets and observes significant improvements over
existing methods, particularly for tasks with limited examples (1- to
16-shot settings). It also examines how well GraphAdapter generalises
unseen data by testing it across four diverse datasets. The findings
reveal that GraphAdapter not only excels in adapting to new tasks with
few examples but also demonstrates strong generalisation capabilities,
outperforming other state-of-the-art methods in most scenarios.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*cJPS40vwRLarkUKU" />
</figure>

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*WHXe9ihe-l-KP_JY" />
<figcaption>The performance comparison of our GraphAdapter with the
state-of-the-art methods on few-shot learning, including
1-/2-/4-/8-/16-shots on 11 benchmark datasets by (Li et al., 2023)
Article link: <a
href="https://arxiv.org/abs/2309.13625">https://arxiv.org/abs/2309.13625</a></figcaption>
</figure>

#### **Knowledge Graph Embeddings (KGEs) can enhance generative models**

Instead of using complex formulas, researchers propose a new way to make
KGEs like [COMPLEX](https://arxiv.org/abs/1606.06357),
[CP](https://arxiv.org/abs/1806.07297),
[RESCAL](https://www.researchgate.net/publication/221345089_A_Three-Way_Model_for_Collective_Learning_on_Multi-Relational_Data),
and [TUCKER](https://arxiv.org/abs/1901.09590) generate new
relationships between concepts. They achieve this by:

1. Transforming existing KGE models into circuits: These circuits
 involve calculations that consider multiple possibilities to arrive
 at different probabilities.
2. Fine-tuning the calculations within the circuits: This ensures the
 final output is always a valid probability (between 0 and 1).

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*zXNiB9141DtgpXcm" />
<figcaption>Explores how certain Knowledge Graph Embedding (KGE) models,
like COMPLEX, CP, RESCAL, and TUCKER, can be adapted into effective
generative models for triples. These models’ scoring functions can be
represented as circuits (highlighted in lilac). To convert these into
valid probabilistic circuits (PC, shown in orange) that model the
probability distribution over triples, we either have to ensure the
activations are non-negative (indicated in blue) or square the
activations (marked in red) by (Niepert et al., 2023) Article link: <a
href="https://arxiv.org/abs/2305.15944">https://arxiv.org/abs/2305.15944</a></figcaption>
</figure>

#### Working

Knowledge Graph Embedding (**KGE**) converts models such as COMPLEX, CP,
RESCAL, and TUCKER into generative models by reinterpreting them as
**circuits**, or structured computational graphs. This reinterpretation
allows for efficient processes like marginalisation, which is crucial
for understanding the distribution of certain variables within a larger
set. Marginalisation is a fundamental concept in probability theory and
statistics. When you have a probability distribution over multiple
variables, marginalisation involves summing (or integrating, in the case
of continuous variables) the probabilities of all possible values of the
variables you are not interested in, to obtain the probability
distribution of the variables you are interested in.

To make these models generative, their outputs are modified through
**non-negative restriction** or **squaring**, ensuring the outputs can
represent probabilities. These adapted circuits, named Generative KGE
Circuits (**GeKCs**), can then generate new triples for knowledge graphs
by efficiently sampling from their modeled probability distributions.
Moreover, GeKCs are designed to integrate logical constraints directly,
ensuring that all generated or predicted triples are logically
consistent, such as adhering to rules that specify how entities can or
cannot relate to each other. This approach not only retains the models'
link prediction capabilities but also enhances their applicability by
enabling them to handle large graphs efficiently and generate new,
plausible triples that respect predetermined logical constraints.

Consider the following scenario as an example:

**Scenario:** Imagine a knowledge graph containing information about
people, movies, and their genres. We want to use a GeKC Model to predict
missing information.

**Logical Constraint:** One logical constraint could be: "A person
cannot act in a movie that is released before their date of birth."

**Existing Triples:**

- (Tom Hanks, acted_in, Forrest Gump)
- (Tom Hanks, date_of_birth, 1956)

**Prediction Task:** Predict the release date of "Forrest Gump".

**Without Logical Constraints:** The model might simply predict any
release date based on statistical patterns in the data. This could lead
to illogical predictions like "Forrest Gump" being released in 1954,
which would contradict Tom Hanks' date of birth.

**With Logical Constraints:** The model with logical constraints would
consider the "date_of_birth" information and the constraint mentioned
earlier. This would eliminate the possible release date predictions of a
date before 1956 (Tom Hanks' date of birth).

#### **Evaluation of the model**

The empirical evaluation demonstrates that GeKCs are competitive with
traditional Knowledge Graph Embeddings (KGEs) for link prediction tasks,
showing that the generative approach does not compromise on accuracy.
Furthermore, incorporating domain constraints directly into the model
significantly enhances predictions by ensuring they adhere to logical
rules, improving reliability and relevance. Lastly, the quality of
triples generated by GeKCs is evaluated through a novel metric,
revealing that GeKCs can efficiently produce high-quality, new triples
that are consistent with the knowledge graph's existing information,
thereby enriching the graph with plausible connections.

#### **Potential Applications**

**Improved Question Answering Systems:** This new method could allow the
virtual assistants to not only find existing answers within the
knowledge graph but also generate new, logical relationships between
concepts. This could lead to more comprehensive and informative answers,
even for complex or open-ended questions.

**Drug Discovery and Material Science:** Researchers could use these
enhanced KGEs to explore potential relationships between different
chemicals or biological entities. By generating new, plausible
connections based on existing knowledge, the system could help identify
promising candidates for new drugs or materials with desired properties.

**Recommendation Systems and Market Analysis:** KGEs are already used by
some recommendation systems to understand user preferences and suggest
relevant products or services. This new approach could allow the system
to go beyond simply recommending existing items. It could potentially
generate new product ideas or identify previously unexplored market
connections based on the knowledge graph.

### **Conclusion**

The integration of Knowledge Graph Embeddings (KGEs) with
Vision-Language Models (VLMs) and Generative Models, particularly
through the innovative GraphAdapter and generative KGE circuits (GeKCs),
represents a significant leap forward in AI's ability to understand and
generate complex content. This approach not only enhances the models'
predictive accuracy and contextual understanding but also ensures
logical consistency in generated content. Moreover, it promises
scalability and efficiency, making it a robust solution for enriching
and expanding knowledge graphs in various applications.

### **References**

- Niepert, M., Garcia-Duran, A., & Onoro-Rubio, D. (2023). How to turn
 your knowledge graph embeddings into generative models via
 probabilistic circuits. arXiv. <https://arxiv.org/abs/2305.15944>
- Li, X., Lian, D., Lu, Z., Bai, J., Chen, Z., & Wang, X. (2023).
 GraphAdapter: Tuning vision-language models with dual knowledge graph.
 arXiv. <https://arxiv.org/abs/2309.13625>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c0e2b4af3aff){width="1"
height="1"}

Bridging Human Perception and AI’s Future: The Convergence of Visual Understanding and Semantic Networks Author · Vaibhav Khobragade (

 ORCID:

0009–0009–8807–5982)

 Introduction

The fusion of Vision-Language Models (

 VLMs

), Generative Models, and Knowledge Graphs (

 KGs

) is reshaping how artificial intelligence (AI) understands and interacts with the world.

Tuning Vision-Language Models and Generative Models with Knowledge Graph

Amanda Kau

**Exploring the potentials and limitations of Vision Language Models**

<figure>
<img src="https://cdn-images-1.medium.com/max/575/0*FPQ2bJ-YSC7MdTTZ" />
<figcaption>Image generated by Google’s Gemini — 11
March 2024.</figcaption>
</figure>

### Author:

- **Amanda Kau (ORCID:**
 [**0009--0004--4949--9284**](https://orcid.org/0009-0004-4949-9284)**)**

The human brain is more extraordinary than any machine we could build.
From an early age, many of us gain the ability to comprehend what our
eyes tell us and articulate it. Furthermore, we combine evidence from
all our senses to reason. Machines, on the other hand, have a long way
to go. Much research has gone into enabling machines to eventually
emulate this human ability through multimodal models, which are capable
of taking in various information formats such as images, text, or audio.

One specific research focus is the creation of vision language models
(**VLMs**), which possess the abilities to 'see' and understand
language. A recent model demonstrating this remarkable ability is GPT-4
with Vision (**GPT-4V**), which allows users to provide an image and
receive responses to questions about the image.

Despite significant developments in VLMs in recent years, they still
possess several shortcomings that will be reviewed in this article. The
structure of VLMs will be briefly introduced with some VLM variants,
followed by example applications in healthcare and helping the visually
impaired. To conclude, this article will review failure cases for VLMs
and future directions.

### Vision Language Models

VLMs are a fusion of computer vision models and large language models
(LLMs), which respectively capture the intricacies of images and
language. While computer vision models excel at processing image data,
they often struggle to understand the meaning behind objects depicted in
images. Conversely, language models exclusively operate on textual data,
so they are adept at navigating the semantics and ambiguities of
language but lack the capability to interpret visual cues. Researchers
aim to combine the advantages of both model types, such that models can
interpret images as well as the textual instructions provided to them
and return well-crafted responses.

Generally, VLMs consist of three components:

- an image encoder,
- a textual encoder, and
- a module to connect image and textual encodings.

An example is **BLIP-2**, which stands for Bootstrapping Language-Image
Pre-training. BLIP-2 contains all three components in VLMs: an image
encoder, an LLM serving as the textual encoder, and the Q-Former
(Querying Transformer) module to bridge the two aforementioned
components. Here, the image and textual encoders are off-the-shelf,
meaning they were designed and trained separately by others and made
available for use. By bootstrapping or randomly sampling from these
encoders, the Q-Former module was trained to extract visual information
that was most informative of the text, and then give the most useful
information to the LLM for the LLM to generate the answer. Therefore,
the Q-Former is placed between the image encoder and LLM in the image,
like a connecting bridge. Comprehensive overviews about VLMs in general
can be found on
[HuggingFace](https://huggingface.co/blog/vision_language_pretraining)
and in
this [article](https://medium.com/@letscodeai/introduction-c071c5433e89).

<figure>
<img src="https://cdn-images-1.medium.com/max/832/0*6h2KTyCj7H56ANHm" />
<figcaption>An example (BLIP-2) which contains the three components of
VLMs. Source: Li et al., 2023 Article link: <a
href="https://doi.org/10.48550/ARXIV.2301.12597">https://doi.org/10.48550/ARXIV.2301.12597</a></figcaption>
</figure>

### Modifications to Vision Language Models

The modular format of VLMs allows researchers to interchange individual
components to enhance their models. For instance, among the various
modifications proposed previously are:

- **Using a different image encoder:** In KRISP by Marino et al. (2020),
 information from the given image is encoded into a knowledge graph.
 This data structure connects objects presented in the image with
 relationships to aid the model in understanding semantics. Image data
 is combined with language information from the BERT LLM.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*uyxPzAjSh3Yd8PZ8" />
<figcaption>Example of KRISP model answering a question given an image.
Source: Marino et al., 2020 <a
href="https://doi.org/10.48550/arXiv.2012.11014">Article link:
https://doi.org/10.48550/arXiv.2012.11014</a></figcaption>
</figure>

- **Employing different connecting modules:** As illustrated in the
 figure above of BLIP-2, the Q-Former module was introduced.
- **Training the components in a novel manner:** Liu et al. (2024)
 proposed [**LLaVA**](https://llava-vl.github.io/) which employed the
 [**CLIP**](https://openai.com/research/clip) (Contrastive
 Language-Image Pre-training) model and
 [**Vicuna**](https://lmsys.org/blog/2023-03-30-vicuna/) to encode text
 and language respectively. To enable their model to perform a variety
 of tasks better, the authors used GPT-4 to generate instruction data
 to train the model to follow instructions more effectively.
- **Adopting a retrieval-augmented approach instead:** In the **REVEAL**
 (Retrieval-Augmented Visual Language) model from Hu et al. (2023), the
 authors opted to encode knowledge from multimodal sources into their
 model, including text and image data. When queried, the model consults
 its knowledge base to enrich its generated response with factual
 information to enhance accuracy.

<figure>
<img src="https://cdn-images-1.medium.com/max/773/0*Fgx2PoPp0yKdl0E1" />
<figcaption>REVEAL retrieves knowledge to craft responses to questions.
Source: Hu et al., 2023 Article link: <a
href="https://doi.org/10.48550/arXiv.2212.05221">https://doi.org/10.48550/arXiv.2212.05221</a></figcaption>
</figure>

### What are Vision Language Models Good For?

VLMs serve as valuable tools to assist humans in various applications
including image captioning, visual summarisation and visual question
answering, among others. These can be extended to applications with
broader impacts, such as:

1. Visual Question Answering in Medical Imagery

<figure>
<img src="https://cdn-images-1.medium.com/max/748/0*ZtA8Tg9VYJKFnGIe" />
<figcaption>Example of VLM being used in medical imagery. Source: Bazi
et al., 2023 Article link: <a
href="https://doi.org/10.3390/bioengineering10030380">https://doi.org/10.3390/bioengineering10030380</a></figcaption>
</figure>

Although VLMs are far from being utilised in the healthcare industry,
they display the potential to support medical practitioners in improving
diagnosis. When presented with an image, VLMs can answer questions about
it, such as in the image above. As machine learning models are trained
on vast amounts of past data, medical professionals could access this
wealth of data that would otherwise be too much for any one person
to handle.

2\. Be My Eyes

In March 2023, OpenAI collaborated with Be My Eyes to develop Be My AI,
which incorporates GPT-4V. This technology allows people who are blind
or have low vision to capture pictures using their smartphones and
receive descriptions of the images from Be My AI, thereby allowing them
greater ease in navigating the visual world. Users are currently
cautioned against relying solely on the application, as GPT-4V could
potentially hallucinate when unsure about what it perceives. With
further development, users may eventually be able to use applications
such as this to obtain additional assistance in tasks such as reading
prescriptions or navigating environments.

### Where Vision Language Models Fail

Even the state-of-the-art models like GPT-4V can fail in simple
scenarios for humans. Some examples are shown in the image below.

<figure>
<img src="https://cdn-images-1.medium.com/max/940/0*mHJcPfU8v7jq7lW7" />
<figcaption>Cases where GPT-4V fails to answer correctly. Source: Tong
et al., 2024 Article link: <a
href="https://doi.org/10.48550/ARXIV.2401.06209">https://doi.org/10.48550/ARXIV.2401.06209</a></figcaption>
</figure>

Despite how remarkable VLMs appear to be, their performances often fall
short compared to what humans can achieve. Most models were found by
Tong et al. (2024) to even perform worse than a random guessing
strategy. The following results were obtained through testing various
models using dual-choice questions termed CLIP-blind image pairs. For
example, an image of a dog lying down would be given with the question,
"Where is the yellow animal's head lying in this image?" The model would
then choose between two options: "(a) Floor" or "(b) Carpet".

<figure>
<img src="https://cdn-images-1.medium.com/max/637/0*arvH1enogp2peHeu" />
<figcaption>Results of various models on CLIP-blind pairs. Source: Tong
et al., 2024 <a
href="https://doi.org/10.48550/ARXIV.2401.06209">https://doi.org/10.48550/ARXIV.2401.06209</a></figcaption>
</figure>

Moreover, Tong et al. (2024) discovered that the problems were even
greater for models utilising CLIP as the image encoder. This presents a
significant issue given the popularity of CLIP, particularly in cases
like medical imagery mentioned previously. CLIP's popularity stems from
its incorporation of both visual and textual encodings in its
architecture, providing a convenient means of linking these two
modalities.

Another significant finding was that merely increasing the size of the
model did not mean it got better at interpreting visual cues. Rather, it
was more effective to use a purely visual encoder like **DINOv2**
instead of a vision-language one like CLIP, but at the expense of the
model losing some ability to follow instructions.

### Conclusion

Vision language models have undergone rapid improvement and will
continue to do so in the coming years. Their modular architecture
permits various modifications discussed in this article, and they
display significant potential for practical applications like in
healthcare or helping the visually impaired. However, they still cannot
perform at the level that humans do. Further research is necessary to
address their shortcomings before these models can be utilised reliably
without human oversight.

### **References**

- Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping
 Language-Image Pre-training with Frozen Image Encoders and Large
 Language Models (Version 3). arXiv.
 <https://doi.org/10.48550/ARXIV.2301.12597>
- Marino, K., Chen, X., Parikh, D., Gupta, A., & Rohrbach, M. (2020).
 KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
 Knowledge-Based VQA (Version 1). arXiv.
 <https://doi.org/10.48550/ARXIV.2012.11014>
- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction
 Tuning (Version 2). arXiv. <https://doi.org/10.48550/ARXIV.2304.08485>
- Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.-W., Sun, Y., Schmid,
 C., Ross, D. A., & Fathi, A. (2022). REVEAL: Retrieval-Augmented
 Visual-Language Pre-Training with Multi-Source Multimodal Knowledge
 Memory (Version 2). arXiv. <https://doi.org/10.48550/ARXIV.2212.05221>
- Bazi, Y., Rahhal, M. M. A., Bashmal, L., & Zuair, M. (2023).
 Vision--Language Model for Visual Question Answering in Medical
 Imagery. In Bioengineering (Vol. 10, Issue 3, p. 380). MDPI AG.
 <https://doi.org/10.3390/bioengineering10030380>
- OpenAI (2023). GPT-4V(ision) System Card.
 <https://openai.com/research/gpt-4v-system-card>
- Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., & Xie, S. (2024). Eyes
 Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
 (Version 1). arXiv. <https://doi.org/10.48550/ARXIV.2401.06209>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=aa0ab2521e7b){width="1"
height="1"}

Exploring the potentials and limitations of Vision Language Models

Author:

 Amanda Kau (ORCID:


 0009–0004–4949–9284


 )

The human brain is more extraordinary than any machine we could build. From an early age, many of us gain the ability to comprehend what our eyes tell us and articulate it. Furthermore, we combine evidence from all our senses to reason.

Rogue Scholar Posts

Tuning Vision-Language Models and Generative Models with Knowledge Graph

How Much Can Vision Language Models Really “See”?