AI Glossary

< Feed

49 terms from the world of artificial intelligence and neural networks.

Fundamentals (6)Models (11)Training (4)NLP (11)Generation (6)Infrastructure (9)Safety (2)

Fundamentals

Models

Transformer

A neural network architecture proposed by Google in 2017. Uses an attention mechanism to process sequences. Powers GPT, BERT, Claude, Gemini, and most modern language models.

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora. Capable of generating, analyzing, and transforming text. Examples: GPT-4, Claude, Gemini, Llama, DeepSeek.

Context Window

The maximum number of tokens a model can process in a single request (including input and output). GPT-4o has 128K, Claude 3.5 has 200K, Gemini has up to 1M.

Attention Mechanism

A key component of transformers that allows the model to focus on relevant parts of the input. Self-attention lets each token 'look at' all other tokens in the sequence.

Multi-Head Attention

An extension of the attention mechanism where multiple attention 'heads' operate in parallel, each focusing on different aspects of the input, improving representation quality.

Multimodal Model

An AI model capable of working with multiple data types: text, images, audio, video. Examples: GPT-4o (text + images), Gemini (text + images + video + audio).

AI Agent

An autonomous LLM-based system capable of planning actions, using tools (search, code, APIs), and iteratively solving tasks without constant human oversight.

Tool Use (Function Calling)

An LLM's ability to call external functions and APIs: web search, code execution, database queries. A key capability for AI agents.

Mixture of Experts (MoE)

An architecture where the model consists of multiple 'experts', but only a subset is activated for each request. Scales parameters without proportional compute increase. Used in Mixtral and DeepSeek V3.

Reasoning

An AI model's ability to reason logically, solve problems, and draw conclusions. Models with explicit reasoning: o1, o3, DeepSeek R1. Use chain-of-thought internally.

Vision Model

An AI model trained to analyze images: classification, object detection, segmentation, captioning. Modern LLMs (GPT-4o, Claude, Gemini) include vision capabilities.

Training

NLP

Token

The smallest unit of text that a language model processes. One token is roughly 4 characters in English or 1–2 characters in Russian. API pricing is typically calculated per token.

Tokenizer

An algorithm that splits text into tokens before feeding it to a language model. Different models use different tokenizers (BPE, SentencePiece, tiktoken).

Prompt Engineering

The practice of crafting text instructions (prompts) for AI models to get the best results. Includes techniques: few-shot, chain-of-thought, system prompts.

Few-Shot Learning

An approach where a model is given a few examples (shots) in the prompt so it understands the expected format and style. Does not require fine-tuning.

Zero-Shot Learning

A model's ability to perform a task without examples — from text description alone. Larger models typically have better zero-shot capabilities.

Chain-of-Thought (CoT)

A prompting technique where the model reasons step by step before giving a final answer. Improves accuracy on logic, math, and multi-step analysis tasks.

RAG (Retrieval-Augmented Generation)

An architecture pattern where the model first retrieves relevant information from a knowledge base, then generates an answer based on the found data. Reduces hallucinations and enables working with up-to-date information.

Embedding

A numerical representation of text (or another object) as a fixed-length vector. Used for semantic search, clustering, and recommendations. Models: text-embedding-ada-002, Cohere Embed.

BPE (Byte Pair Encoding)

A tokenization algorithm that iteratively merges the most frequent character pairs. Used in GPT, Claude, and most modern LLMs.

Temperature

A generation parameter controlling the randomness of model responses. Low temperature (0.0–0.3) yields deterministic answers, high (0.7–1.0) produces more creative and varied outputs.

Top-p (Nucleus Sampling)

A sampling method where the model picks the next token from the smallest set of tokens whose cumulative probability is >= p. An alternative to top-k sampling.

Generation

Infrastructure

Vector Database

A database optimized for storing and searching embeddings. Enables fast semantic similarity search. Examples: Pinecone, Weaviate, Qdrant, ChromaDB.

Quantization

A method of reducing model size by lowering weight precision (from FP16 to INT8 or INT4). Enables running large models on consumer GPUs with minimal quality loss.

GGUF

A file format for quantized models used by llama.cpp and other tools for running LLMs locally. Supports various quantization levels (Q4, Q5, Q8).

VRAM (Video RAM)

Video memory of a graphics processor. Determines the maximum model size that can be loaded for inference. Llama 70B in FP16 requires ~140 GB VRAM.

Inference

The process of getting a response from a trained model — feeding input and generating output. Unlike training, inference requires significantly less compute.

Latency

Model response time — from sending a request to receiving the first token (TTFT) or the full response. A key metric for production systems.

Throughput

The number of tokens per second a model can generate. Depends on model size, quantization, GPU, and the number of concurrent requests.

API (Application Programming Interface)

A programming interface for interacting with AI models. Allows sending requests and receiving responses via HTTP. Major providers: OpenAI, Anthropic, Google, DeepSeek.

llama.cpp

A highly optimized engine for running LLMs locally on CPU and GPU. Supports GGUF format and various quantization levels. One of the most popular tools for running models on consumer hardware.

Safety