Exercise 1 - Build a FAISS Flat Index
Goal: Build a FAISS
IndexFlatIPfrom scratch, add vectors to it, and run your first real vector search.
Background
IndexFlatIP is the simplest FAISS index. It stores every vector in full and does exact inner product search - no approximation. It is correct by definition and fast enough for up to ~100k vectors.
Because it does inner product, you must L2-normalise all vectors before adding them. Inner product on unit vectors equals cosine similarity.
Assignment
Open 01_flat_index.py.
- Load a PDF from
data/papers/, split it into 500-char chunks, and embed them all. - Stack the embeddings into a
(N, 768)float32 numpy array. Make sure every vector is L2-normalised. - Create a
faiss.IndexFlatIP(768)and add all vectors withindex.add(vectors). - Write a
search(query_text, index, chunks, k)function. It should embed the query, reshape to(1, 768), callindex.search(query, k), and return the top-k chunks with their scores. - Run three different queries and print the top-3 results with scores for each.
Thinking questions
- FAISS does not store the original text, only vectors. What data structure do you need alongside the index to map a result position back to the chunk text?
index.search()returns two arrays:scoresandindices. What are their shapes when you search for k=3 with a single query?- What would happen if you added un-normalised vectors to a
IndexFlatIPindex and searched with a normalised query?