Exercise 2 - Similarity
Goal: Compute cosine similarity between embeddings and build intuition for what scores actually mean.
Background
In Exercise 1 you saw that an embedding is a 768-dimensional vector. Similarity between two embeddings is the cosine of the angle between their vectors - a number between -1 and 1.
The script normalises vectors before comparing them, so similarity is just a dot product:
similarity = np.dot(a, b) # only valid when both are L2-normalised
Assignment
Open 02_similarity.py and run it:
python exercises/02_similarity.py
It embeds 5 sentences and prints a similarity matrix. Study the output, then:
- Identify the pair with the highest similarity. Does the number match your intuition?
- Find two sentences that share words but are semantically different. Are their scores high or low? What does this tell you about how embeddings work versus keyword matching?
- Add two new sentences of your own - one that should be very similar to sentence [0], and one that should be very different. Run again and check the scores.
- What is the similarity of any sentence with itself? Why is that the maximum possible value?
Thinking questions
- Cosine similarity measures angle, not distance. Two vectors can point in the same direction but have very different magnitudes. Why does normalisation solve this?
- If you used a different embedding model, would the similarity scores between the same sentences be the same? Why or why not?