AI Glossary

49 terms from the world of artificial intelligence and neural networks.

Fundamentals (6)Models (11)Training (4)NLP (11)Generation (6)Infrastructure (9)Safety (2)

Fundamentals

A branch of computer science focused on building systems capable of performing tasks that typically require human intelligence: speech recognition, decision-making, language translation, and content generation.

Machine Learning (ML)

A subset of AI where algorithms learn patterns from data without being explicitly programmed. Includes supervised, unsupervised, and reinforcement learning.

Deep Learning

A subset of machine learning that uses neural networks with many layers (deep networks). Powers modern breakthroughs in language processing, computer vision, and content generation.

Neural Network

A mathematical model inspired by biological neurons. Consists of layers of nodes (neurons), each performing simple computations. Together, layers can approximate complex functions.

Open Source AI

AI models with open weights and/or code that can be downloaded and run locally. Examples: Llama, Mistral, Qwen, DeepSeek, Gemma. Opposite of proprietary models (GPT-4, Claude).

Benchmark

A standardized test for evaluating AI model quality. Popular benchmarks: MMLU (knowledge), HumanEval (code), MT-Bench (chat), LMSYS Chatbot Arena (user ELO rating).

Models

Transformer

A neural network architecture proposed by Google in 2017. Uses an attention mechanism to process sequences. Powers GPT, BERT, Claude, Gemini, and most modern language models.

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora. Capable of generating, analyzing, and transforming text. Examples: GPT-4, Claude, Gemini, Llama, DeepSeek.

Context Window

The maximum number of tokens a model can process in a single request (including input and output). GPT-4o has 128K, Claude 3.5 has 200K, Gemini has up to 1M.

Attention Mechanism

A key component of transformers that allows the model to focus on relevant parts of the input. Self-attention lets each token 'look at' all other tokens in the sequence.

Multi-Head Attention

An extension of the attention mechanism where multiple attention 'heads' operate in parallel, each focusing on different aspects of the input, improving representation quality.

Multimodal Model

An AI model capable of working with multiple data types: text, images, audio, video. Examples: GPT-4o (text + images), Gemini (text + images + video + audio).

AI Agent

An autonomous LLM-based system capable of planning actions, using tools (search, code, APIs), and iteratively solving tasks without constant human oversight.

Tool Use (Function Calling)

An LLM's ability to call external functions and APIs: web search, code execution, database queries. A key capability for AI agents.

Mixture of Experts (MoE)

An architecture where the model consists of multiple 'experts', but only a subset is activated for each request. Scales parameters without proportional compute increase. Used in Mixtral and DeepSeek V3.

Reasoning

An AI model's ability to reason logically, solve problems, and draw conclusions. Models with explicit reasoning: o1, o3, DeepSeek R1. Use chain-of-thought internally.

Vision Model

An AI model trained to analyze images: classification, object detection, segmentation, captioning. Modern LLMs (GPT-4o, Claude, Gemini) include vision capabilities.

Training

Fine-Tuning

The process of further training a pre-trained model on a specialized dataset to adapt it for a specific task. Requires less data and compute than training from scratch.

LoRA (Low-Rank Adaptation)

An efficient fine-tuning method that updates only low-rank adapters instead of all model weights. Significantly reduces memory and compute requirements.

RLHF (Reinforcement Learning from Human Feedback)

A training method where the model improves based on human evaluations. Used for alignment — making models helpful, safe, and instruction-following.

Transfer Learning

An approach where a model trained on one task is adapted to solve another. The basis of fine-tuning: a base model is trained on general data, then refined on specialized data.

NLP

Token

The smallest unit of text that a language model processes. One token is roughly 4 characters in English or 1–2 characters in Russian. API pricing is typically calculated per token.

Tokenizer

An algorithm that splits text into tokens before feeding it to a language model. Different models use different tokenizers (BPE, SentencePiece, tiktoken).

Prompt Engineering

The practice of crafting text instructions (prompts) for AI models to get the best results. Includes techniques: few-shot, chain-of-thought, system prompts.

Few-Shot Learning

An approach where a model is given a few examples (shots) in the prompt so it understands the expected format and style. Does not require fine-tuning.

Zero-Shot Learning

A model's ability to perform a task without examples — from text description alone. Larger models typically have better zero-shot capabilities.

Chain-of-Thought (CoT)

A prompting technique where the model reasons step by step before giving a final answer. Improves accuracy on logic, math, and multi-step analysis tasks.

RAG (Retrieval-Augmented Generation)

An architecture pattern where the model first retrieves relevant information from a knowledge base, then generates an answer based on the found data. Reduces hallucinations and enables working with up-to-date information.

Embedding

A numerical representation of text (or another object) as a fixed-length vector. Used for semantic search, clustering, and recommendations. Models: text-embedding-ada-002, Cohere Embed.

BPE (Byte Pair Encoding)

A tokenization algorithm that iteratively merges the most frequent character pairs. Used in GPT, Claude, and most modern LLMs.

Temperature

A generation parameter controlling the randomness of model responses. Low temperature (0.0–0.3) yields deterministic answers, high (0.7–1.0) produces more creative and varied outputs.

Top-p (Nucleus Sampling)

A sampling method where the model picks the next token from the smallest set of tokens whose cumulative probability is >= p. An alternative to top-k sampling.

Generation

Diffusion Model

A type of generative model that learns to create data (typically images) by gradually removing noise. Used in Stable Diffusion, DALL-E 3, Midjourney, FLUX.

Stable Diffusion

An open-source diffusion model for generating images from text. Can run locally on consumer GPUs. Versions: SD 1.5, SDXL, SD 3.

Text-to-Image

The task of generating an image from a text description. Major services: Midjourney, DALL-E 3, Stable Diffusion, FLUX, Ideogram.

GAN (Generative Adversarial Network)

An architecture of two neural networks: a generator creates data, a discriminator evaluates its realism. Used for image generation before diffusion models became dominant.

Text-to-Speech (TTS)

Technology for converting text to natural speech. Modern TTS systems use neural networks to generate realistic voices. Examples: OpenAI TTS, ElevenLabs, Bark.

Speech-to-Text (STT)

Technology for recognizing speech and converting it to text. Examples: OpenAI Whisper, Google Speech-to-Text, Deepgram. Whisper is an open model with high quality.

Infrastructure

Vector Database

A database optimized for storing and searching embeddings. Enables fast semantic similarity search. Examples: Pinecone, Weaviate, Qdrant, ChromaDB.

Quantization

A method of reducing model size by lowering weight precision (from FP16 to INT8 or INT4). Enables running large models on consumer GPUs with minimal quality loss.

GGUF

A file format for quantized models used by llama.cpp and other tools for running LLMs locally. Supports various quantization levels (Q4, Q5, Q8).

VRAM (Video RAM)

Video memory of a graphics processor. Determines the maximum model size that can be loaded for inference. Llama 70B in FP16 requires ~140 GB VRAM.

Inference

The process of getting a response from a trained model — feeding input and generating output. Unlike training, inference requires significantly less compute.

Latency

Model response time — from sending a request to receiving the first token (TTFT) or the full response. A key metric for production systems.

Throughput

The number of tokens per second a model can generate. Depends on model size, quantization, GPU, and the number of concurrent requests.

API (Application Programming Interface)

A programming interface for interacting with AI models. Allows sending requests and receiving responses via HTTP. Major providers: OpenAI, Anthropic, Google, DeepSeek.

llama.cpp

A highly optimized engine for running LLMs locally on CPU and GPU. Supports GGUF format and various quantization levels. One of the most popular tools for running models on consumer hardware.

Safety

Hallucination

When an AI model generates plausible but factually incorrect output. A key challenge of LLMs, addressed through RAG, ground truth data, and verification.

Alignment

The process of adjusting an AI model so its behavior aligns with human values and expectations. Includes RLHF, constitutional AI, and other safety techniques.