GenAI with LLMs (1) Fundamentals

This post covers fundamentals from the Generative AI With LLMs course offered by DeepLearning.AI.

Transformer variants

Figure 1. Transformer variants. Source: course lectures.

Encoder only models

Also known as Autoencoding models.
Input sequence and output sequence are the same length.
Use cases
- Sentiment analysis
- Named entity recognition
- Word classification
Examples
- BERT
- RoBERTa

Encoder Decoder models

Also known as Seq2Seq models.
Input sequence and output sequence can be different lengths.
Use cases
- Translation
- Text summarization
- Question answering
Examples
- BART
- T5

Decoder only models

Also known as Autoregressive models.
Widely used today. Can generalize to most tasks.
Use cases
- Text generation
Examples
- GPTs
- BLOOM
- Jurassic
- LLaMA
- etc

Prompt engineering

In-context learning

In-context learning: including examples of the task that you want the model to carry out inside the prompt can get the model to produce better outcomes. Providing examples inside the context window is called in-context learning.

Zero-shot, one-shot, few-shot

Zero-shot inference: including zero examples.

One-shot inference: including a single example.

Few-shot inference: including several examples.

Note:

Large LLMs can do well with zero-shot inference, smaller LLMs struggle with zero-shot inference and benefit by learning examples of the desired behavior.
You have a limit on the amount of in-context learning that you can pass to the model. If you find your model isn’t performing well when including 5 or 6 examples, you should try fine-tuning your model instead.

Generative configuration - inference parameters

Figure 2. Inference configuration parameters. Source: course lectures.

Max new tokens

Used to limit the number of tokens that the model will generate. You can think of this as putting a cap on the number of times the model will go through the selection process (i.e., from computing probability distribution to sampling or selecting the next token).
Remember it’s max new tokens, not a hard number of new tokens generated.

Top-k and Top-p

Recall that the output from the transformer’s softmax layer is a probability distribution across the entire dictionary of words that the model uses.

Figure 3. Greedy decoding and random sampling. Source: course lectures.

Greedy decoding

Model chooses the word/token with the highest probability.
This method can work very well for short generation but is susceptible to repeated words or repeated sequences of words. If you want to generate texts that’s more natural, more creative, and avoid repeating words, you need to use some other controls.

Random (weighted) sampling

The model chooses the output word at random using the probability distribution to weight the selection.
Reduce the likelihood that words will be repeated. However, depending on the setting, there is a possibility that the output may be too creative, producing words that cause the generation to wander off into topics or words that just don’t make sense.

Top-k and top-p are sampling methods we can use to limit the random sampling and increase the chance that the output will be sensible.

Top-k

Figure 4. Top-k. Source: course lectures.

The model to choose from only the \(k\) tokens with the highest probability. The model then selects from these options using the probability weighting
This method can help the model have some randomness while preventing the selection of highly improbable completion words. This in turn makes your text generation more likely to sound reasonable and to make sense.

Top-p

Figure 5. Top-p. Source: course lectures.

Limit the random sampling to the predictions whose combined probabilities do not exceed \(p\).

Temperature

Figure 6. Temperature. Source: course lectures.

This parameter influences the shape of the probability distribution that the model calculates for the next token.
The temperature is a scaling factor that’s applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token. In contrast to top-k and top-p, changing temperature actually alters the predictions the model will make.
Cooler temperature (\(<1\)) - strongly peaked distribution.
Warmer temperature (\(>1\)) - broader, flatter distribution.

GenAI lifecycle

Figure 7. Generative AI project lifecycle. Source: course lectures.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.
- The Transformer architecture, with the core “self-attention” mechanism.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
- BERT
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
- RoBERTa
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv preprint arXiv:1910.10683.
- T5
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461.
- BART
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
- GPT-3
BigScience Workshop. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint arXiv:2211.05100.
- BLOOM
AI21 Labs. (2021). Jurassic-1: Technical Details and Evaluation. AI21 Labs Blog.
- Jurassic
OpenAI. (2023). Language models can explain neurons in language models. arXiv preprint arXiv:2302.13971.
- LLaMA