From distances to data to the inner workings of AI β demystified.
Three connected topics that build your mental model of modern AI: how machines measure similarity, where your data actually lives, and what happens when an LLM encounters questions it can't answer.
Whether you're evaluating AI tools or leading a team that uses them, understanding how LLMs find, process, and fail with data is now a core professional skill. This lesson gives you that foundation.
Imagine you're a taxi driver in New York City. You can't drive diagonally through buildings β you must follow the grid of streets.
Manhattan Distance: sum of absolute differences along each axis
d = |xβ β xβ| + |yβ β yβ|
Also called L1 distance or taxicab distance. It measures how far apart two points are by summing the absolute differences of their coordinates.
Distance metrics are how machines measure similarity. They're foundational to search, recommendations, and retrieval.
Manhattan (L1): Sums absolute differences. Great for high-dimensional, sparse data. Used in clustering and nearest-neighbor search.
Euclidean (L2): Straight-line "as the crow flies" distance. The classic formula: β((xββxβ)Β² + (yββyβ)Β²). Sensitive to outliers.
Cosine Similarity: Measures the angle between two vectors, not the magnitude. This is what most embedding-based AI search uses today.
When an LLM-powered app searches for relevant documents, it converts text to vectors (embeddings) and uses distance metrics to find the closest matches. This is the heart of RAG β Retrieval Augmented Generation.
A delivery robot is at grid position (2, 3) and needs to reach (7, 6). What is the Manhattan distance?
Before AI can help, it needs to reach your data. But enterprise data is scattered everywhere.
The modern data landscape: scattered, siloed, and diverse
Let's explore where your data actually lives β and why that matters for AI.
ERP, CRM, HR systems β the "official" source of truth. Structured, governed, but often locked behind APIs. Examples: SAP, Salesforce, Workday.
Relational (SQL Server, PostgreSQL) and NoSQL (MongoDB, DynamoDB). Structured data with schemas β the easiest for AI to consume when properly connected.
Network drives, SharePoint, local folders. Word docs, spreadsheets, PDFs β a treasure trove of unstructured data that AI often can't see.
Massive volumes of institutional knowledge trapped in inboxes. Rich context, but privacy and compliance constraints make it tricky for AI.
Scalable object storage. S3 buckets can hold anything β images, logs, backups, data lake files. But storage β understanding; data needs indexing.
Your team has years of client proposal documents saved on a shared network drive. Why might an LLM struggle to use this data effectively?
So how do we connect LLMs to data they've never seen? Enter Retrieval Augmented Generation (RAG).
The evolution: from simple retrieval to knowledge-graph-powered generation
RAG retrieves relevant chunks of text from a vector database and feeds them to the LLM as context. It's the bridge between your data and the model.
Documents β chunked β embedded as vectors β stored in a vector DB. User query is also embedded, and the nearest chunks are retrieved.
What is the key advantage of GraphRAG over basic RAG?
LLMs are powerful β but they have a fundamental limitation: they only know what they were trained on.
Text broken into tokens, viewed through a limited context window
Every LLM has a training cutoff date. Ask about events after that date and the model has no data β it may hallucinate a plausible-sounding but incorrect answer.
Tokens are the smallest units an LLM processes. They're not exactly words β they can be word fragments, punctuation, or even single characters.
Rule of thumb: 1 token β ΒΎ of a word in English.
The sentence "ChatGPT is amazing!" tokenizes roughly as:
That's ~5 tokens. Longer or unusual words get split into more pieces.
Tokens determine cost (you pay per token), speed (more tokens = slower), and what fits in the context window.
Think of the context window as the LLM's short-term memory. It's the total number of tokens the model can "see" at once.
~4Kβ16K tokens
~128K tokens
~200K tokens
~1M+ tokens
Even with huge context windows, the LLM still only knows what's in that window right now plus its training data. Your private company data? Not included unless you put it there (via RAG, fine-tuning, or prompt injection).
An LLM with a 128K token context window is asked about your company's Q3 earnings report from last month. What is the most likely outcome?
Here's the full picture of how these three topics connect:
Manhattan and other distance metrics are how AI finds relevant information. Text becomes vectors; closeness means relevance.
Time to test your understanding! Answer all 5 questions, then submit for your score.