Discord 线程管理¶
本指南将介绍如何管理来自持续更新数据源的文档。
在本示例中,我们有一个定期转储 LlamaIndex Discord 服务器上 #issues-and-help 频道内容的目录。我们需要确保索引始终包含最新数据,同时避免消息重复。
索引 Discord 数据¶
Discord 数据以连续消息的形式导出。每条消息都包含有用信息,如时间戳、作者信息,以及当消息属于某个主题线程时对应的父消息链接。
在我们的 Discord 帮助频道中,解决问题时通常会使用主题线程,因此我们将所有消息按线程分组,并将每个线程作为独立文档进行索引。
首先,让我们了解正在处理的数据结构。
import os
print(os.listdir("./discord_dumps"))
['help_channel_dump_06_02_23.json', 'help_channel_dump_05_25_23.json']
如你所见,我们有两份来自不同日期的数据转储文件。假设我们最初只有较旧的那份转储文件,现在需要基于这些数据创建索引。
首先,让我们初步探索一下数据
import json
with open("./discord_dumps/help_channel_dump_05_25_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5087 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Hello there! How can I use llama_index with GPU?
为了方便起见,我已提供一个脚本用于将这些消息按主题分组。详情可查看 group_conversations.py 脚本。输出文件将是一个 json 列表,其中每个条目代表一个 Discord 主题对话。
!python ./group_conversations.py ./discord_dumps/help_channel_dump_05_25_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata'])
{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'}
arminta7:
Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made.
So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬.
Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓
Thank you for making this sort of project accessible to someone like me!
ragingWater_:
I had a similar problem which I solved the following way in another world:
- if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you.
- for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
现在,我们有了一个线程列表,可以将其转换为文档并建立索引!
创建初始索引¶
from llama_index.core import Document
# create document objects using doc_id's and dates from each thread
documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
让我们再次确认索引实际摄入了哪些文档
print("ref_docs ingested: ", len(index.ref_doc_info))
print("number of input documents: ", len(documents))
ref_docs ingested: 767 number of input documents: 767
目前为止一切顺利。我们还需要检查一个具体的线程,以确保元数据正常工作,同时查看它被分割成了多少个节点。
thread_id = threads[0]["metadata"]["id"]
print(index.ref_doc_info[thread_id])
RefDocInfo(node_ids=['0c530273-b6c3-4848-a760-fe73f5f8136e'], metadata={'date': '2023-01-02T03:36:04.191+00:00'})
完美!我们的线程相当简短,因此直接被切分为单个节点。此外,我们可以看到日期字段已正确设置。
接下来,让我们备份索引,这样就不必再次浪费令牌进行索引了。
# save the initial index
index.storage_context.persist(persist_dir="./storage")
# load it again to confirm it worked
from llama_index.core import StorageContext, load_index_from_storage
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir="./storage")
)
print("Double check ref_docs ingested: ", len(index.ref_doc_info))
Double check ref_docs ingested: 767
使用新数据刷新索引!¶
现在,我们突然想起还有一批新的 Discord 消息转储!与其从头开始重建整个索引,我们可以使用 refresh() 函数仅索引新增文档。
由于我们手动设置了每个索引的 doc_id,LlamaIndex 可以通过比较具有相同 doc_id 的传入文档来确认:a) 该 doc_id 是否已被实际摄取;b) 内容是否发生变更
refresh 函数将返回一个布尔数组,指示输入中的哪些文档被刷新或插入。我们可以利用这一点来确认只有新的 Discord 主题被添加!
当文档内容发生变更时,系统会调用 update() 函数,该函数会从索引中移除并重新插入该文档。
import json
with open("./discord_dumps/help_channel_dump_06_02_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5286 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Started a thread.
如我们所见,第一条消息与原始转储内容相同。但现在我们新增了约200条消息,最后一条消息显然是新出现的!refresh() 方法将让索引更新变得轻而易举。
首先,让我们创建新的线程/文档
!python ./group_conversations.py ./discord_dumps/help_channel_dump_06_02_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata'])
{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'}
arminta7:
Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made.
So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬.
Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓
Thank you for making this sort of project accessible to someone like me!
ragingWater_:
I had a similar problem which I solved the following way in another world:
- if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you.
- for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
# create document objects using doc_id's and dates from each thread
new_documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
new_documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
print("Number of new documents: ", len(new_documents) - len(documents))
Number of new documents: 13
# now, refresh!
refreshed_docs = index.refresh(
new_documents,
update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)
默认情况下,若文档内容已变更且被更新,我们可以向 delete_from_docstore 传递额外标志。该标志默认为 False,因为多个索引可能共享同一个文档存储。但由于当前仅存在单一索引,在此场景下从文档存储中移除是可行的。
若保持该选项为 False,文档信息仍会从 index_struct 中移除,这将导致该文档对索引实质上不可见。
print("Number of newly inserted/refreshed docs: ", sum(refreshed_docs))
Number of newly inserted/refreshed docs: 15
有趣的是,我们有13份新文档,但却刷新了15份文档。是否有人编辑了他们的消息?还是在讨论串中添加了更多文字?让我们一探究竟
print(refreshed_docs[-25:])
[False, True, False, False, True, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
new_documents[-21]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='36d308d1d2d1aa5cbfdb2f7d64709644a68805ec22a6053943f985084eec340e', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\nSiddhant Saurabh:\nI think this happened because of the error mentioned by me here https://discord.com/channels/1059199217496772688/1106229492369850468/1108453477081948280\nI think we need to re-preprocessing for such nodes, right?\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
documents[-8]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c995c43873440a9d0263de70fff664269ec70d751c6e8245b290882ec5b656a1', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
很好!较新的文档中包含消息更多的对话线程。如你所见,refresh() 能够检测到这一点,并自动用更新后的文本替换了旧线程。