Anthropic 提示词缓存¶
本 Notebook 将演示如何结合 LlamaIndex 抽象层使用 Anthropic 提示词缓存功能。
通过在消息请求中标记 cache_control
即可启用提示词缓存。
提示词缓存工作原理¶
当启用提示词缓存的请求发出时:
- 系统会检查该提示词前缀是否已在近期查询中被缓存
- 若存在缓存,则直接使用缓存版本,从而减少处理时间和成本
- 若不存在,则完整处理提示词并将前缀缓存供后续使用
注意事项:
A. 提示词缓存功能支持以下模型:Claude 4 Opus
、Claude 4 Sonnet
、Claude 3.7 Sonnet
、Claude 3.5 Sonnet
、Claude 3.5 Haiku
、Claude 3 Haiku
和 Claude 3 Opus
B. 最小可缓存提示词长度要求:
1. Claude 3.5 Haiku 和 Claude 3 Haiku 需至少 2048 个 token
2. 其他所有模型需至少 1024 个 token
C. 更短的提示词即使标记了 cache_control
也无法被缓存
配置 API 密钥¶
import os
os.environ[
"ANTHROPIC_API_KEY"
] = "sk-ant-..." # replace with your Anthropic API key
配置大语言模型¶
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-3-5-sonnet-20240620")
下载数据¶
在本演示中,我们将使用保罗·格雷厄姆文集
中的文本内容。我们会缓存这些文本并基于其运行若干查询。
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'
--2024-12-14 18:39:03-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘./paul_graham_essay.txt’ ./paul_graham_essay 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-12-14 18:39:03 (1.62 MB/s) - ‘./paul_graham_essay.txt’ saved [75042/75042]
加载数据¶
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./paul_graham_essay.txt"],
).load_data()
document_text = documents[0].text
提示词缓存¶
要启用提示词缓存功能,只需在 LlamaIndex 中使用 CachePoint
代码块:该代码块之前的所有内容都将被缓存。
我们可以通过检查以下参数来验证文本是否被缓存:
cache_creation_input_tokens:
创建新缓存条目时写入缓存的 token 数量。
cache_read_input_tokens:
本次请求从缓存中检索到的 token 数量。
input_tokens:
既未从缓存读取也未用于创建缓存的输入 token 数量。
from llama_index.core.llms import (
ChatMessage,
TextBlock,
CachePoint,
CacheControl,
)
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhy did Paul Graham start YC?",
type="text",
),
CachePoint(cache_control=CacheControl(type="ephemeral")),
],
),
]
resp = llm.chat(messages)
让我们检查原始响应。
resp.raw
{'id': 'msg_01PAaZDTjEqcZksFiiqYH42t', 'content': [TextBlock(text='Based on the essay, it seems Paul Graham started Y Combinator (YC) for a few key reasons:\n\n1. He had experience as a startup founder with Viaweb and wanted to help other founders avoid mistakes he had made.\n\n2. He had ideas about how venture capital could be improved, like making more smaller investments in younger technical founders.\n\n3. He was looking for something new to work on after selling Viaweb to Yahoo and trying painting for a while.\n\n4. He wanted to gain experience as an investor and thought funding a batch of startups at once would be a good way to do that.\n\n5. It started as a "Summer Founders Program" to give undergrads an alternative to summer internships, but quickly grew into something more serious.\n\n6. He saw an opportunity to scale startup funding by investing in batches of companies at once.\n\n7. He was excited by the potential to help create new startups and technologies.\n\n8. It allowed him to continue working with his friends/former colleagues Robert Morris and Trevor Blackwell.\n\n9. He had built an audience through his essays that provided deal flow for potential investments.\n\nSo in summary, it was a combination of wanting to help founders, improve venture capital, gain investing experience, work with friends, and leverage his existing audience/expertise in the startup world. The initial idea evolved quickly from a summer program into a new model for seed investing.', type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=4, output_tokens=305, cache_creation_input_tokens=9, cache_read_input_tokens=17467)}
如你所见,由于我已多次运行此操作,cache_creation_input_tokens
和 cache_read_input_tokens
的数值均大于零,这表明文本已被正确缓存。
现在,让我们对同一文档执行另一个查询。系统应从缓存中检索文档文本,这将反映在 cache_read_input_tokens
的数值上。
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhat did Paul Graham do growing up?",
type="text",
),
CachePoint(cache_control=CacheControl(type="ephemeral")),
],
),
]
resp = llm.chat(messages)
resp.raw
{'id': 'msg_011TQgbpBuBkZAJeatVVcqtp', 'content': [TextBlock(text='Based on the essay, here are some key things Paul Graham did growing up:\n\n1. As a teenager, he focused mainly on writing and programming outside of school. He tried writing short stories but says they were "awful".\n\n2. At age 13-14, he started programming on an IBM 1401 computer at his school district\'s data processing center. He used an early version of Fortran.\n\n3. In high school, he convinced his father to buy a TRS-80 microcomputer around 1980. He wrote simple games, a program to predict model rocket flight, and a word processor his father used.\n\n4. He went to college intending to study philosophy, but found it boring. He then decided to switch to studying artificial intelligence (AI).\n\n5. In college, he learned Lisp programming language, which expanded his concept of what programming could be. \n\n6. For his undergraduate thesis, he reverse-engineered SHRDLU, an early natural language processing program.\n\n7. He applied to grad schools for AI and ended up going to Harvard for graduate studies.\n\n8. In grad school, he realized AI as practiced then was not going to achieve true intelligence. He pivoted to focusing more on Lisp programming.\n\n9. He started writing a book about Lisp hacking while in grad school, which was eventually published in 1993 as "On Lisp".\n\nSo in summary, his early years were focused on writing, programming (especially Lisp), and studying AI, before he eventually moved on to other pursuits after grad school. The essay provides a detailed account of his intellectual development in these areas.', type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=4, output_tokens=356, cache_creation_input_tokens=0, cache_read_input_tokens=17476)}
如你所见,响应是通过缓存文本生成的,如 cache_read_input_tokens
所示。
在Anthropic中,默认缓存时长为5分钟。您也可以设置更持久的缓存,例如1小时,只需在CachControl
的ttl
参数中指定即可。
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhat did Paul Graham do growing up?",
type="text",
),
CachePoint(
cache_control=CacheControl(type="ephemeral", ttl="1h"),
),
],
),
]
resp = llm.chat(messages)