Exercise 2 - Persist and Reload
Goal: Save your FAISS index to disk and reload it in a fresh process, without re-embedding the document.
Background
In Exercise 1 you built the index every time the script ran. That means re-embedding the entire document on each run - slow and wasteful. In a real pipeline you build the index once and reload it instantly.
FAISS handles vector storage with faiss.write_index / faiss.read_index. But FAISS only stores vectors, not text. You need to save the chunk list separately.
Assignment
Open 02_persist.py.
Part A - Save
- If
index.faissdoes not already exist, build the index from Exercise 1. - Save the index:
faiss.write_index(index, "index.faiss"). - Save the chunks list to
chunks.jsonusing thejsonmodule. - Print the file sizes of both files.
Part B - Load
- Load the index:
faiss.read_index("index.faiss"). - Load the chunks list from
chunks.json. - Run the same query as in Exercise 1 and confirm results are identical.
- Time the load. Compare it to the time it took to build from scratch.
Thinking questions
- You saved chunks as plain text in JSON. What metadata would you want to save alongside the text in a production system?
faiss.write_indexonly saves vectors and index structure - not the original Document objects or their page numbers. Design a simple file format (JSON or otherwise) that would store everything you need to fully reconstruct the search results.