0% found this document useful (0 votes)
18 views11 pages

Understanding Large Language Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Understanding Large Language Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 5: Applications and Future Directions

5.1 What are Large Language Models(LLMs)?


A large language model is a type of artificial intelligence algorithm that applies neural
network techniques with lots of parameters to process and understand human languages or
text using self-supervised learning techniques. Tasks like text generation, machine
translation, summary writing, image generation from texts, machine coding, chat-bots, or
Conversational AI are applications of the Large Language Model.

Popular Large Language Models


Now let's look at some of the famous LLMs which has been developed and are up for
inference.
 GPT-3: GPT 3 is developed by OpenAI, stands for Generative Pre-trained Transformer
3. This model powers ChatGPT and is widely recognized for its ability to generate
human-like text across a variety of applications.
 BERT: It is created by Google, is commonly used for natural language processing tasks
and generating text embeddings, which can also be utilized for training other models.
 RoBERTa: RoBERTa is an advanced version of BERT, stands for Robustly Optimized
BERT Pretraining Approach. Developed by Facebook AI Research, it enhances the
performance of the transformer architecture.
 BLOOM: It is the first multilingual LLM, designed collaboratively by multiple
organizations and researchers. It follows an architecture similar to GPT-3, enabling
diverse language-based tasks.
For implementation details, these models are available on open-source platforms like
Hugging Face and OpenAI for Python-based applications.

Large Language Models Use Cases


 Code Generation: LLMs can generate accurate code based on user instructions for
specific tasks.
 Debugging and Documentation: They assist in identifying code errors, suggesting
fixes, and even automating project documentation.
 Question Answering: Users can ask both casual and complex questions, receiving
detailed, context-aware responses.
 Language Translation and Correction: LLMs can translate text between over 50
languages and correct grammatical errors.
 Prompt-Based Versatility: By crafting creative prompts, users can unlock endless
possibilities, as LLMs excel in one-shot and zero-shot learning scenarios.
Use cases of LLM are not limited to the above-mentioned one has to be just creative
enough to write better prompts and you can make these models do a variety of tasks as they
are trained to perform tasks on one-shot learning and zero-shot learning methodologies as
well. Due to this only Prompt Engineering is a totally new and hot topic in academics for
people who are looking forward to using ChatGPT-type models extensively.

Applications of Large Language Models


LLMs, such as GPT-3, have a wide range of applications across various domains. Few of
them are:
 Natural Language Understanding (NLU):
o Large language models power advanced chatbots capable of engaging in
natural conversations.
o They can be used to create intelligent virtual assistants for tasks like
scheduling, reminders, and information retrieval.
 Content Generation:
o Creating human-like text for various purposes, including content creation,
creative writing, and storytelling.
o Writing code snippets based on natural language descriptions or commands.
 Language Translation: Large language models can aid in translating text between
different languages with improved accuracy and fluency.
 Text Summarization: Generating concise summaries of longer texts or articles.
 Sentiment Analysis: Analyzing and understanding sentiments expressed in social
media posts, reviews, and comments.

5.2 Current Approaches to Large Language Models (LLMs)

[Link] Architecture –
Most LLMs (e.g., GPT, BERT) are based on the transformer architecture introduced in 2017.
Self-attention mechanisms allow for parallel processing and contextual understanding.

2. Pre-training and Fine-tuning – Pre-training:


Models are trained on massive corpora using self-supervised learning (e.g., next-word
prediction, masked language modelling).
- Fine-tuning: Models are further trained on task-specific data to improve performance (e.g.,
summarization, Q&A).

3. Instruction Tuning
- Models are trained on prompts and instructions to follow human-like directives (e.g.,
FLAN, InstructGPT).
- Enhances usability in conversational AI and task completion.

4. Reinforcement Learning from Human Feedback (RLHF)


- Human preferences guide model output using reinforcement learning.
- Used in OpenAI's ChatGPT to improve alignment and reduce harmful responses.

5. Retrieval-Augmented Generation (RAG)


- Combines LLMs with external search or databases.
- Improves factual accuracy and scalability for domain-specific tasks.

6. Multimodal Models
- Trained on text, images, audio, and video for richer understanding (e.g., GPT-4, Gemini,
Claude). - Enables applications in vision
-language tasks, such as image captioning or visual Q&A.

7. Model Compression and Efficiency


- Techniques include quantization, distillation, and pruning.
- Makes models deployable on edge devices or with reduced cost.
8. Open vs. Proprietary Models
- Open-source models (e.g., LLaMA, Mistral) allow community-driven innovation.
- Proprietary models (e.g., GPT-4, Gemini) often lead in performance but are closed-source.

9. Alignment and Safety Research


- Focus on reducing bias, toxicity, and harmful outputs.
- Includes adversarial testing, red-teaming, and interpretability research.

5.3 Challenges of LLMS

Overview of LLM Challenges: Designing LLMs relates to decisions taken before


deployment. Behavioural challenges occur during deployment. Science challenges hinder
academic progress.

5.3.1 Unfathomable Datasets:


Scaling the amount of pre-training data has been one of the major drivers to equip LLMs with
general-purpose capabilities. The size of pre-training datasets quickly outgrew the number of
documents most human teams could manually quality-check. Instead, most data collection
procedures rely on heuristics regarding data sources and filtering. The adverse consequences
of these heuristics and the reality that many model practitioners possess only a nebulous
understanding of the data on which their model has been trained. We refer to this issue as
follows.
i)Near-Duplicates/NearDup : can arise in different forms and have been reported to degrade
model performance . Near-duplicatesare harder to find compared to exact duplicates, filtering
out of such is a standard step in most data collection pipelines .e.g., it contains a 61- word
sequence repeated 61, 036 times in the training split. By duplicating it, they reduce the rate of
emitted memorizations by 10x. SemDeDup, a technique designed to identify semantic
duplicates that, although perceptually distinct, convey predominantly similar information,
such as sentences with analogous structures with certain words replaced by synonyms.
ii) Benchmark Data Contamination: occurs when the training dataset contains data from or
similar to the evaluation test set. This can lead to inflated performance metrics, as the model
can memorize the test data and simply regurgitate it back during testing.
iii) Personally Identifiable Information (PII) such as phone numbers and email addresses,
have been found within pre-training corpora, resulting in privacy leaks during prompting.

5.3.2 Tokenizer-Reliance
Tokenization is the process of breaking a sequence of words or characters into smaller units
called tokens, such that they can be fed into the model. One common tokenization approach
is subword tokenization, where we split words into smaller units, called subwords or
WordPieces [490]. The goal is to handle rare and out-of-vocabulary words in a model‘s
vocabulary effectively while maintaining a limited number of tokens per sequence in the
interest of computational complexity. Subword tokenizers are usually trained unsupervised to
build a vocabulary and optionally merge rules to encode the training data efficiently.

5.3.3 High Pre-Training Costs


The vast majority of the training costs go toward the pre-training process. Training a single
LLM can require hundreds of thousands of compute hours, which in turn cost millions of
dollars

5.3.4 Fine-Tuning Overhead


A potential drawback of pre-training LLMs on massive and diverse sets of textual data is that
the resulting models might struggle to explicitly capture the distributional properties of task-
specific datasets. To address this, fine-tuning refers to adapting the pre-trained model
parameters on comparatively smaller datasets that are specific to an individual domain or
task. LLM fine-tuning is highly effective at adapting LLMs for downstream tasks .
Technically speaking, fine-tuning can be achieved by further training a model on a smaller
dataset. Depending on the model architecture, this is done by either (i) directly fine-tuning
pre-trained models using a standard language modelling objective or (ii) adding individual
learnable layers to the output representations of a pre-trained language model, which are
designed to create compatibility between the model‘s output representations and the output
formats of individual downstream tasks (e.g., for text classification or sequence labelling).

5.3.5 High Inference Latency


the two reasons why LLMs exhibit high inference latencies are: (1) low parallelizability
since the inference procedure proceeds one token at a time and (2) large memory footprints,
due to the model size and the transient states needed during decoding (e.g., attention key and
value tensors).
[Link] Software
Various frameworks have been designed to enable the efficient training of multi-billion to
trillion parameter language models such as DeepSpeed and Megatron-LM to account for the
unique challenges arising when training such models. This is necessitated by the fact that
most LLMs do not fit into a single device‘s (GPU, TPU) memory, and scaling across GPUs
andchallenges arising when training such models. This is necessitated by the fact that most
LLMs do not fit into a single device‘s (GPU, TPU) memory, and scaling across GPUs and
compute nodes needs to account for communication and synchronization costs.

5.3.6 Limited Context Length


Addressing everyday NLP tasks often necessitates an understanding of a broader context. For
example, if the task at hand is discerning the sentiment in a passage from a novel or a
segment of an academic paper, it is not sufficient to merely analyze a few words or sentences
in isolation. The entirety of the input (or context), which might encompass the whole section
or even the complete document, must be considered. Similarly, in a meeting transcript, the
interpretation of a particular comment could pivot between sarcasm and seriousness,
depending on the prior discussion in the meeting.

5.3.7 Prompt Brittleness


A prompt is an input to the LLM. The prompt syntax (e.g., length, blanks, ordering of
examples) and semantics (e.g., wording, selection of examples, instructions) can have a
significant impact on the model‘s output [342]. As an analogy, if we were to think of an LLM
as a (fuzzy) database and prompts as queries [246], it becomes clear that slight changes in the
query can result in vastly different outputs. Consequently, the wording, as well as the order of
examples included in a prompt, have been found to influence the model‘s behaviour
significantly.

5.3.8 Hallucinations
The popularity of services like ChatGPT suggests that LLMs are increasingly used for
everyday question-answering. As a result, the factual accuracy of these models has become
more significant
Unfortunately, LLMs often suffer from hallucinations, which contain inaccurate information
that can be hard to detect due to the text‘s fluency.

5.3.9 Misaligned Behaviour


The alignment problem refers to the challenge of ensuring that the LLM‘s behaviour aligns
with human values, objectives, and expectations and that it does not cause unintended or
undesirable harms or consequences]. Most of the existing alignment work can be categorized
into either methods for detecting misaligned behaviour (such as model evaluation and
auditing, mechanistic interpretability, or red teaming) or methods for aligning model
behaviour (such as pre-training with human feedback, instruction fine-tuning, or RLHF).

5.3.10 Out-dated Knowledge


Factual information learned during pre-training can contain inaccuracies or become outdated
with time. However, re-training the model with updated pre-training data is expensive, and
trying to ―unlearn‖ old facts and learn new ones during fine-tuning is non-trivial. Existing
model editing techniques are limited in their effectiveness of updating isolated knowledge.

5.3.11 Brittle Evaluations


One reason why the evaluation of language models is a challenging problem is that they have
an uneven capabilities surface—a model might be able to solve a benchmark problem without
issues, but a slight modification of the problem (or even a simple change of the prompt) can
give the opposite result. Unlike humans, we cannot easily infer that an LLM that can solve
one problem will have other related capabilities. This means that it is difficult to assess the
performance of LLMs holistically since rigorous benchmarks are needed to identify
weaknesses for a wide variety of inputs.

5.3.12 Evaluations Based on Static, Human-Written Ground Truth


Another challenge of LLM evaluations is that they often rely on human-written ‗ground
truth‘ text. However, we often want to evaluate their performance in domains where such text
is scarce or relies on expert knowledge, such as programming or mathematics tasks. As
models get more capable and perform better than humans on benchmark tests in some
domains, the ability to obtain comparisons to ‗human-level‘ performance diminishes.

5.3.13 In distinguishability between Generated and Human-Written Text


Detecting language generated by LLMs is important for various reasons; some of which
include preventing (1) the spread of misinformation (e.g., authoritative-sounding false
narratives citing fake studies) , (2) plagiarism (e.g., LLMs prompted to rewrite existing
content in ways that bypass plagiarism detection tools) , (3) impersonation or identify theft
(e.g., by mimicking a person‘s writing style) [, and (4) automated scams and frauds (e.g.,
large-scale generation of phishing emails) , and (5) accidentally including inferior generated
text in future models‘ training data . However, such detections become less trivial as the
fluency of LLMs improves.

5.3.14 Tasks Not Solvable


By Scale The ongoing advancements of LLM capabilities consistently astonish the research
community, for instance, by achieving high performances on the MMLU benchmark much
sooner than competitive human forecasters had anticipated . Similarly, within less than a
year, OpenAI released GPT-3.5 and GPT-4, where the latter significantly outperformed the
former on various tasks.

5.3.15 Lacking Experimental Designs


overview of selected LLMs within the scope of this review, described in academic papers.
Many works do not include controlled ablations, which is especially problematic due to their
large design space. We posit that this impedes scientific comprehension and advancement.

5.3.16 Lack of Reproducibility


The reproducibility of empirical results is important to verify scientific claims and rule out
errors in experimental protocols leading to such. When researchers try to build upon non-
reproducible results, they might waste resources.
5.4 Real-world applications of large language models
Large Language Models (LLMs) like GPT-4 are increasingly being integrated into a wide
variety of real-world applications across industries. Here are some key areas where they‘re
making an impact:

1. Customer Support & Chatbots


 Use Case: Automating responses in customer service.
 Examples: Chatbots on e-commerce sites, banks, and telecom services.
 Benefit: Reduces wait times, operates 24/7, and scales easily.

2. Content Generation
 Use Case: Writing marketing copy, news articles, or social media content.
 Examples: Tools like Jasper, [Link], and even ChatGPT for drafting blogs.
 Benefit: Saves time and improves productivity for content teams.

3. Code Assistance

 Use Case: Helping developers write, debug, or understand code.


 Examples: GitHub Copilot, Amazon CodeWhisperer.
 Benefit: Boosts developer efficiency and reduces coding errors.

4. Education & Tutoring

 Use Case: Personalized tutoring, answering student questions, creating quizzes.


 Examples: Khanmigo (Khan Academy), Socratic by Google.
 Benefit: Offers accessible and personalized learning support.

5. Healthcare & Medical Research

 Use Case: Summarizing medical literature, assisting with documentation, or patient


interaction.
 Examples: Nuance (by Microsoft), clinical documentation assistants.
 Benefit: Reduces administrative burden for healthcare providers.

6. Legal & Compliance

 Use Case: Contract analysis, summarization, and legal research.


 Examples: Luminance, Casetext.
 Benefit: Speeds up legal review and reduces risk of human oversight.

7. Translation & Language Services

 Use Case: Real-time translation, content localization.


 Examples: DeepL, Google Translate enhancements.
 Benefit: Facilitates cross-border communication and global commerce.
8. Search & Knowledge Management

 Use Case: Enhancing internal knowledge bases or improving search relevance.


 Examples: Semantic search in enterprise tools, Microsoft Copilot.
 Benefit: Makes information retrieval faster and more intuitive.

9. Personalized Recommendations

 Use Case: Tailoring content or product suggestions using natural language


understanding.
 Examples: Netflix, Amazon, Spotify leveraging LLMs to enhance UX.
 Benefit: Increases user engagement and satisfaction.

10. Accessibility

 Use Case: Assisting users with disabilities (e.g., screen readers, text simplification).
 Examples: Tools that convert complex text to simpler language or speech.
 Benefit: Improves digital accessibility and inclusivity.

5.5 Emerging trends and future directions in Generative AI

Generative modeling is an artificial intelligence (AI) technique that generates synthetic


artifacts by analyzing training examples; learning their patterns and distribution; and then
creating realistic facsimiles. Generative AI (GAI) uses generative modeling and advances in
deep learning (DL) to produce diverse content at scale by utilizing existing media such as
text, graphics, audio, and video.1,2 While mainly used in research settings, GAI is entering
various domains and everyday scenarios.

5.5.1 GAI TECHNIQUES Although there are many forms of GAI, we will look at four of
the most common techniques being leveraged today:
1. Generative adversarial networks : Generative adversarial networks (GANs) are the most
prevalent GAI technique being used today.3 A GAN uses a pair of neural networks. One,
known as the generator, synthesizes the content (for example, an image of a human face). The
second, known as the discriminator, evaluates the authenticity of the generator‘s content,
(that is, whether the face is natural or fake). The networks repeat this generate/discriminate
cycle until the generator produces content that the discriminator cannot discern between real
and synthetic.
2. Generative Pre-trained Transformer: Generative Pre-trained Transformer (GPT)
models generate text in different languages and can create human-sounding words, sentences,
and paragraphs on almost any topic and writing style—from convincing news articles and
essays to conversations in customer-service chatbots or characters in video games.4 These
have matured over several generations, each with an increased parameter set trained on a
more extensive online textual corpus than the previous. One recent example is OpenAI‘s
GPT-3, which stunned the AI world by writing, without human assistance, a convincing
article about scientists discovering a herd of unicorns in the Andes.
3. The generative diffusion model : The generative diffusion model (GDM) synthesizes
content by taking a training data distribution, gradually adding noise, and learning how to
recover the data as a reversal of the noise addition process.6 This way, data are generated
from randomly sampled noise through the learned denoising process.
4. Geometric DL Geometric DL (GDL) : attempts to understand, interpret, and describe AI
models in terms of geometric principles. These principles have already been extensively
studied over domains such as grids; transformations in homogeneous spaces; graphs; and
vector bundles.

5.5.2 USING GAI


These and other GAI techniques are being used in a host of applications, including the
following.
1. Natural language and music GPTs can be readily applied to natural language (NL) text
generation. GPT-3,4 mentioned previously, has been successfully scaled to 175 billion
learnable parameters and trained on global-scale corpora of textual exemplars. Aside from
showing high performance on a variety of NL processing (NLP) tasks, such as translation and
question answering, it is also a competent text generator producing eerily human-like textual
content. Language Model for Dialogue Applications (LaMDA) is another example. This
generative, textual conversational agent mimics human conversations,12 but unlike GPT
models trained on text corpora, it is trained on dialog corpora. Objective-Reinforced GAN
(ORGAN) is another example that produces time series artifacts in sequential media such as
music.
2. Computer graphics AlphaFold is a neural network that creates highly accurate 3D protein
structures14 by modeling and predicting protein structures as a graph inference problem in
3D space where nearby residues define the edges of the graph. The pair representation is
encoded as a directed edge in a graph (that is, the connection between the residues). The
NVIDIA Canvas application GauGAN transforms a textual phrase like ―ocean waves hitting
rocks on the beach‖ into virtual landscape images in real time. When adding adjectives like
―sunset at a rocky beach‖ or swapping ―sunset‖ for ―afternoon‖ or ―rainy day,‖ the model
modifies the picture [Link], DALL•E is a compiled version of GPT-3 that
produces images from text descriptions for concepts expressed in NL, taking text/image pairs
as input.
3. Computer vision Using semantic label maps as an input, conditional GANs (CGANs) can
produce images of high-fidelity urban scenes containing objects. Changing labels modifies
scenes concerning individual objects, such as replacing trees with buildings or changing
colors or [Link] (Text-Guided Diverse Face Image Generation and Manipulation)
creates human portrait drawings from facial photos with random changes to facial
attributes.SinGAN22 is a single-image generative model that synthesizes realistic textures of
arbitrary size and aspect ratio with significant variability.

5.5.3 Future GAI solutions must meet the following expectations to gain user trust and
adoption. i)Efficiency Training and deploying GAI models leave a significant carbon
footprint and high computation costs. For example, GDMs naturally lag in sampling
[Link] a fine-tuned, downsized model for input data and parameter space is a cost-
efficient approach for researchers and practitioners.
ii) Explainability Explaining the mechanisms of a deep neural network involves analyzing
the input data properties (also known as features) to determine which affect the outcomes and
infer what happens inside the black box. However, determining which neurons affected the
synthesis of which output objects remains problematic today. Moreover, in the case of GANs,
quantifying the mutual behavior of a pair of networks is currently an intractable problem.
iii) Fairness Although generative language models such as GPTs offer marked improvements
over various NLP tasks, they require massive amounts of unfiltered online text.
Consequently, they can generate synthetic language with bias, stereotypes, and harmful
content. To manage these risks, GAI providers should offer tools for preprocessing and
curating training data; monitoring and moderating the media generation processes; and
developing guidelines for responsible deployment models.
iv) Ethics GAI models can immediately synthesize artifacts at scale for many different
contexts, from education to medical decision making. However, before diving into production
deployment, model creators should clearly define their goals; identify beneficiaries; and
confirm usage scenarios with target users to prevent unintended unethical product behavior.
This requires that all affected stakeholders—GAI scientists, AI engineers, domain experts,
regulatory authorities, and target users—are identified and actively participate.

v) Accountability Prospective users must weigh GAI products‘ benefits against their risks.
Organizations that are creating, training, and deploying GAI systems must diligently strive to
reduce model behavior risks. This requires teams to be thorough, transparent, and proactive in
communicating identified threats, blind spots, and areas where risks are unknown when
highlighting GAI system benefits

5.5.4 GAI WORKFLOW

To address these challenges, we envision a future GAI deployment workflow as in Figure.

The workflow can provide benefits beyond GAI models. Because synthetic data creation
increases the data set size for a particular GAI model, it can become part of the model‘s
iterative and incremental development and deployment cycle to continually improve
performance. Specifically, if a GAI model works well overall but performs poorly on certain
features, more data can be generated for those critical categories to help detect and correct
errors. Moreover, traditional ML workflows take good testing results on used data sets as an
indicator of modeling input data distribution. However, physical- and digital-world data sets
collected over several model generations will change their class structure and internal
connections. Newly synthesized objects or artifacts, such as a language, an environment, or a
(human) being, are examples of the data sets. When observing and evaluating such changes in
deployment, the straightforward way to expand generative capacity is to retrain the model
with modified class structure data. GAI has become a key technology in synthesized new
virtual artifacts or enhanced semisynthesized augmented artifacts. GAI‘s achievements in
many fields are paving the way for synthetic sciences to combine AI with basic disciplines
such as engineering, biology, medicine, and environmental science. Although current
advancements within a discipline are seldom connected with developments in others due to
various sociotechnical factors like communication norms, cultural differences, AI models,
data, and procedures, the metaverse is a promising global, interdisciplinary testbed for
resolving these obstacles. GAI would diminish the distinction between real and virtual
artifacts to meld human experience and behavior across the virtual and physical worlds.

Common questions

Powered by AI

Multimodal models integrate text, images, audio, and video data, expanding AI capabilities into vision-language tasks such as image captioning and visual Q&A. By processing diverse input types, they enable richer and more comprehensive AI interactions across different application domains .

Ensuring fairness involves preprocessing and curating training data to mitigate bias, while ethical use requires defining clear goals and monitoring synthesis processes to avoid harmful outputs. Stakeholder collaboration is necessary to maintain ethical standards in deployment .

Fine-tuning adapts pre-trained model parameters to specific tasks using smaller, domain-specific datasets. This process enhances model performance for targeted applications by focusing on task-specific nuances not captured during general pre-training .

Challenges include the presence of near-duplicates, which degrade model performance, and benchmark data contamination, which skews evaluation metrics. Additionally, there's a risk of privacy violations due to personally identifiable information in the pre-training corpora .

LLMs assist in generating accurate code from user instructions and help identify code errors by suggesting fixes and automating documentation processes, enhancing productivity in software development .

Accountability requires transparency about model behavior risks and continuous risk assessment to balance benefits against potential harms. It is vital to ensure responsible deployment, aligning with user expectations and minimizing unintended consequences .

GANs function through the interaction of two neural networks: a generator that creates content and a discriminator that evaluates its authenticity. Applications include image synthesis where the networks iterate until the generator produces outputs indistinguishable from real data .

The Transformer Architecture allows parallel processing of data through self-attention mechanisms, which facilitates contextual understanding, crucial for LLMs such as GPT and BERT. This architecture supports efficient processing of large data volumes, making it foundational to modern NLP applications .

High training costs, primarily due to substantial compute hours needed for pre-training, impose financial barriers and limit accessibility. This impacts the scale at which institutions or individuals can deploy LLMs, often reserving advancements for well-funded entities .

Instruction tuning trains models to follow human-like directives, improving their performance in conversational AI and task completion by making AI interactions more intuitive and user-friendly .

You might also like