Exercise 2 - Split Text
Goal: Use
RecursiveCharacterTextSplitterto chunk a document, inspect the output, and understand how overlap works.
Background
RecursiveCharacterTextSplitter tries to split on paragraph breaks first (\n\n), then line breaks, then sentences, then words - falling back to smaller boundaries only when a piece is still over the chunk_size limit.
Overlap repeats the tail of each chunk at the start of the next so that sentences cut at a boundary are not lost.
Assignment
Open 02_split_text.py.
- Load a PDF from
data/papers/(reuse your loader from Exercise 1). - Split the full document text using
RecursiveCharacterTextSplitterwithchunk_size=500andchunk_overlap=100. - Print the total number of chunks produced.
- Print the length (in characters) of each chunk.
- Print the first and last 100 characters of chunk[0] and chunk[1]. Confirm that chunk[1] starts with text that also appeared near the end of chunk[0] - that is the overlap.
Now change chunk_overlap to 0 and re-run. Does chunk[1] now start where chunk[0] ended, with no repetition?
Thinking questions
- If a single paragraph is longer than
chunk_size, what does the splitter do? - You set
chunk_size=500but some chunks might be shorter. Why? - Overlap increases index size (more chunks, some redundant content). When is it worth the cost?