邮件数据提取¶
OpenAI 函数可用于从电子邮件中提取数据。这是利用 LLamaIndex 从非结构化内容中获取结构化数据的另一个示例。
本示例的主要目标是将原始邮件内容转换为易于理解的 JSON 格式,展示语言模型在数据提取中的实际应用。提取出的结构化 JSON 数据可用于任何下游应用程序。
我们将使用下图所示的示例邮件。该邮件模拟了 ARK Investment 向其订阅者发送的典型日常通讯。这封示例邮件包含其交易所交易基金(ETFs)下的详细交易信息。通过这个具体示例,我们旨在展示如何有效地从真实场景的邮件中提取并结构化复杂的金融数据,将其转换为易于理解的 JSON 格式。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-program-openai
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-program-openai
In [ ]:
Copied!
# LlamaIndex
!pip install llama-index
# To get text conents from .eml and .msg file
!pip install "unstructured[msg]"
# LlamaIndex
!pip install llama-index
# To get text conents from .eml and .msg file
!pip install "unstructured[msg]"
启用日志记录并设置 OpenAI API 密钥¶
在此步骤中,我们将设置日志记录功能以监控程序运行状态并在需要时进行调试。同时配置 OpenAI API 密钥,这是使用 OpenAI 服务的关键凭证。请将 "YOUR_KEY_HERE" 替换为您实际的 OpenAI API 密钥。
In [ ]:
Copied!
import logging
import sys, json
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys, json
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
import os
import openai
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
import openai
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]
设置预期的 JSON 输出定义(JSON Schema)¶
此处我们使用 Pydantic 库定义一个名为 EmailData 的 Python 类。该类模拟了我们预期从电子邮件中提取的数据结构,包括发件人、收件人、邮件日期和时间,以及包含该 ETF 下交易股票列表的 ETF 集合。
In [ ]:
Copied!
from pydantic import BaseModel, Field
from typing import List
class Instrument(BaseModel):
"""Datamodel for ticker trading details."""
direction: str = Field(description="ticker trading - Buy, Sell, Hold etc")
ticker: str = Field(
description="Stock Ticker. 1-4 character code. Example: AAPL, TSLS, MSFT, VZ"
)
company_name: str = Field(
description="Company name corresponding to ticker"
)
shares_traded: float = Field(description="Number of shares traded")
percent_of_etf: float = Field(description="Percentage of ETF")
class Etf(BaseModel):
"""ETF trading data model"""
etf_ticker: str = Field(
description="ETF Ticker code. Example: ARKK, FSPTX"
)
trade_date: str = Field(description="Date of trading")
stocks: List[Instrument] = Field(
description="List of instruments or shares traded under this etf"
)
class EmailData(BaseModel):
"""Data model for email extracted information."""
etfs: List[Etf] = Field(
description="List of ETFs described in email having list of shares traded under it"
)
trade_notification_date: str = Field(
description="Date of trade notification"
)
sender_email_id: str = Field(description="Email Id of the email sender.")
email_date_time: str = Field(description="Date and time of email")
from pydantic import BaseModel, Field
from typing import List
class Instrument(BaseModel):
"""Datamodel for ticker trading details."""
direction: str = Field(description="ticker trading - Buy, Sell, Hold etc")
ticker: str = Field(
description="Stock Ticker. 1-4 character code. Example: AAPL, TSLS, MSFT, VZ"
)
company_name: str = Field(
description="Company name corresponding to ticker"
)
shares_traded: float = Field(description="Number of shares traded")
percent_of_etf: float = Field(description="Percentage of ETF")
class Etf(BaseModel):
"""ETF trading data model"""
etf_ticker: str = Field(
description="ETF Ticker code. Example: ARKK, FSPTX"
)
trade_date: str = Field(description="Date of trading")
stocks: List[Instrument] = Field(
description="List of instruments or shares traded under this etf"
)
class EmailData(BaseModel):
"""Data model for email extracted information."""
etfs: List[Etf] = Field(
description="List of ETFs described in email having list of shares traded under it"
)
trade_notification_date: str = Field(
description="Date of trade notification"
)
sender_email_id: str = Field(description="Email Id of the email sender.")
email_date_time: str = Field(description="Date and time of email")
从 .eml / .msg 文件加载内容¶
在此步骤中,我们将使用 llama-hub 中的 UnstructuredReader 来加载 .eml 电子邮件文件或 .msg Outlook 文件的内容。该文件内容随后会被存储到一个变量中以供进一步处理。
In [ ]:
Copied!
# get donload_loader
from llama_index.core import download_loader
# get donload_loader
from llama_index.core import download_loader
In [ ]:
Copied!
# Create a download loader
from llama_index.readers.file import UnstructuredReader
# Initialize the UnstructuredReader
loader = UnstructuredReader()
# For eml file
eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml")
email_content = eml_documents[0].text
print("\n\n Email contents")
print(email_content)
# Create a download loader
from llama_index.readers.file import UnstructuredReader
# Initialize the UnstructuredReader
loader = UnstructuredReader()
# For eml file
eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml")
email_content = eml_documents[0].text
print("\n\n Email contents")
print(email_content)
In [ ]:
Copied!
# For Outlook msg
msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg")
msg_content = msg_documents[0].text
print("\n\n Outlook contents")
print(msg_content)
# For Outlook msg
msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg")
msg_content = msg_documents[0].text
print("\n\n Outlook contents")
print(msg_content)
使用 LLM 函数以 JSON 格式提取内容¶
在最后一步中,我们利用 llama_index 包创建一个提示模板,用于从已加载的电子邮件中提取关键信息。通过实例化 OpenAI 模型来解析邮件内容,并根据预定义的 EmailData 模式提取相关信息。最终输出结果会被转换为字典格式,便于查看和处理。
In [ ]:
Copied!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
In [ ]:
Copied!
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for extracting insights from email in JSON format. \n"
"You extract data and returns it in JSON format, according to provided JSON schema, from given email message. \n"
"REMEMBER to return extracted data only from provided email message."
),
),
ChatMessage(
role="user",
content=(
"Email Message: \n" "------\n" "{email_msg_content}\n" "------"
),
),
]
)
llm = OpenAI(model="gpt-3.5-turbo-1106")
program = OpenAIPydanticProgram.from_defaults(
output_cls=EmailData,
llm=llm,
prompt=prompt,
verbose=True,
)
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for extracting insights from email in JSON format. \n"
"You extract data and returns it in JSON format, according to provided JSON schema, from given email message. \n"
"REMEMBER to return extracted data only from provided email message."
),
),
ChatMessage(
role="user",
content=(
"Email Message: \n" "------\n" "{email_msg_content}\n" "------"
),
),
]
)
llm = OpenAI(model="gpt-3.5-turbo-1106")
program = OpenAIPydanticProgram.from_defaults(
output_cls=EmailData,
llm=llm,
prompt=prompt,
verbose=True,
)
In [ ]:
Copied!
output = program(email_msg_content=email_content)
print("Output JSON From .eml File: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=email_content)
print("Output JSON From .eml File: ")
print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.016},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}],"trade_date":"1/12/2024"}],"trade_notification_date":"1/12/2024","sender_email_id":"ark@ark-funds.com","email_date_time":"1/12/2024"}
Output JSON From .eml File:
{
"etfs": [
{
"etf_ticker": "ARKK",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TSLA",
"company_name": "TESLA INC",
"shares_traded": 93654.0,
"percent_of_etf": 0.2453
},
{
"direction": "Buy",
"ticker": "TXG",
"company_name": "10X GENOMICS INC",
"shares_traded": 159506.0,
"percent_of_etf": 0.0907
},
{
"direction": "Buy",
"ticker": "CRSP",
"company_name": "CRISPR THERAPEUTICS AG",
"shares_traded": 86268.0,
"percent_of_etf": 0.0669
},
{
"direction": "Buy",
"ticker": "RXRX",
"company_name": "RECURSION PHARMACEUTICALS",
"shares_traded": 289619.0,
"percent_of_etf": 0.0391
},
{
"direction": "Sell",
"ticker": "HOOD",
"company_name": "ROBINHOOD MARKETS INC",
"shares_traded": 927.0,
"percent_of_etf": 0.0001
},
{
"direction": "Sell",
"ticker": "EXAS",
"company_name": "EXACT SCIENCES CORP",
"shares_traded": 100766.0,
"percent_of_etf": 0.0829
},
{
"direction": "Sell",
"ticker": "TWLO",
"company_name": "TWILIO INC",
"shares_traded": 108523.0,
"percent_of_etf": 0.0957
},
{
"direction": "Sell",
"ticker": "PD",
"company_name": "PAGERDUTY INC",
"shares_traded": 302096.0,
"percent_of_etf": 0.0958
},
{
"direction": "Sell",
"ticker": "PATH",
"company_name": "UIPATH INC",
"shares_traded": 553172.0,
"percent_of_etf": 0.1476
}
]
},
{
"etf_ticker": "ARKW",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TSLA",
"company_name": "TESLA INC",
"shares_traded": 18148.0,
"percent_of_etf": 0.2454
},
{
"direction": "Sell",
"ticker": "HOOD",
"company_name": "ROBINHOOD MARKETS INC",
"shares_traded": 49.0,
"percent_of_etf": 0.0
},
{
"direction": "Sell",
"ticker": "PD",
"company_name": "PAGERDUTY INC",
"shares_traded": 9756.0,
"percent_of_etf": 0.016
},
{
"direction": "Sell",
"ticker": "TWLO",
"company_name": "TWILIO INC",
"shares_traded": 21849.0,
"percent_of_etf": 0.0994
},
{
"direction": "Sell",
"ticker": "PATH",
"company_name": "UIPATH INC",
"shares_traded": 105944.0,
"percent_of_etf": 0.1459
}
]
},
{
"etf_ticker": "ARKG",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TXG",
"company_name": "10X GENOMICS INC",
"shares_traded": 38042.0,
"percent_of_etf": 0.0864
},
{
"direction": "Buy",
"ticker": "CRSP",
"company_name": "CRISPR THERAPEUTICS AG",
"shares_traded": 21197.0,
"percent_of_etf": 0.0656
},
{
"direction": "Buy",
"ticker": "RXRX",
"company_name": "RECURSION PHARMACEUTICALS",
"shares_traded": 67422.0,
"percent_of_etf": 0.0363
},
{
"direction": "Buy",
"ticker": "RPTX",
"company_name": "REPARE THERAPEUTICS INC",
"shares_traded": 15410.0,
"percent_of_etf": 0.0049
},
{
"direction": "Sell",
"ticker": "EXAS",
"company_name": "EXACT SCIENCES CORP",
"shares_traded": 32057.0,
"percent_of_etf": 0.1052
}
]
}
],
"trade_notification_date": "1/12/2024",
"sender_email_id": "ark@ark-funds.com",
"email_date_time": "1/12/2024"
}
针对 Outlook 消息¶
In [ ]:
Copied!
output = program(email_msg_content=msg_content)
print("Output JSON from .msg file: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=msg_content)
print("Output JSON from .msg file: ")
print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}]},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.0160},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}]},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}]}],"trade_notification_date":"1/12/2024","sender_email_id":"ark-invest.com","email_date_time":"1/12/2024"}
Output JSON :
{
"etfs": [
{
"etf_ticker": "ARKK",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TSLA",
"company_name": "TESLA INC",
"shares_traded": 93654.0,
"percent_of_etf": 0.2453
},
{
"direction": "Buy",
"ticker": "TXG",
"company_name": "10X GENOMICS INC",
"shares_traded": 159506.0,
"percent_of_etf": 0.0907
},
{
"direction": "Buy",
"ticker": "CRSP",
"company_name": "CRISPR THERAPEUTICS AG",
"shares_traded": 86268.0,
"percent_of_etf": 0.0669
},
{
"direction": "Buy",
"ticker": "RXRX",
"company_name": "RECURSION PHARMACEUTICALS",
"shares_traded": 289619.0,
"percent_of_etf": 0.0391
},
{
"direction": "Sell",
"ticker": "HOOD",
"company_name": "ROBINHOOD MARKETS INC",
"shares_traded": 927.0,
"percent_of_etf": 0.0001
},
{
"direction": "Sell",
"ticker": "EXAS",
"company_name": "EXACT SCIENCES CORP",
"shares_traded": 100766.0,
"percent_of_etf": 0.0829
},
{
"direction": "Sell",
"ticker": "TWLO",
"company_name": "TWILIO INC",
"shares_traded": 108523.0,
"percent_of_etf": 0.0957
},
{
"direction": "Sell",
"ticker": "PD",
"company_name": "PAGERDUTY INC",
"shares_traded": 302096.0,
"percent_of_etf": 0.0958
},
{
"direction": "Sell",
"ticker": "PATH",
"company_name": "UIPATH INC",
"shares_traded": 553172.0,
"percent_of_etf": 0.1476
}
]
},
{
"etf_ticker": "ARKW",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TSLA",
"company_name": "TESLA INC",
"shares_traded": 18148.0,
"percent_of_etf": 0.2454
},
{
"direction": "Sell",
"ticker": "HOOD",
"company_name": "ROBINHOOD MARKETS INC",
"shares_traded": 49.0,
"percent_of_etf": 0.0
},
{
"direction": "Sell",
"ticker": "PD",
"company_name": "PAGERDUTY INC",
"shares_traded": 9756.0,
"percent_of_etf": 0.016
},
{
"direction": "Sell",
"ticker": "TWLO",
"company_name": "TWILIO INC",
"shares_traded": 21849.0,
"percent_of_etf": 0.0994
},
{
"direction": "Sell",
"ticker": "PATH",
"company_name": "UIPATH INC",
"shares_traded": 105944.0,
"percent_of_etf": 0.1459
}
]
},
{
"etf_ticker": "ARKG",
"trade_date": "1/12/2024",
"stocks": [
{
"direction": "Buy",
"ticker": "TXG",
"company_name": "10X GENOMICS INC",
"shares_traded": 38042.0,
"percent_of_etf": 0.0864
},
{
"direction": "Buy",
"ticker": "CRSP",
"company_name": "CRISPR THERAPEUTICS AG",
"shares_traded": 21197.0,
"percent_of_etf": 0.0656
},
{
"direction": "Buy",
"ticker": "RXRX",
"company_name": "RECURSION PHARMACEUTICALS",
"shares_traded": 67422.0,
"percent_of_etf": 0.0363
},
{
"direction": "Buy",
"ticker": "RPTX",
"company_name": "REPARE THERAPEUTICS INC",
"shares_traded": 15410.0,
"percent_of_etf": 0.0049
},
{
"direction": "Sell",
"ticker": "EXAS",
"company_name": "EXACT SCIENCES CORP",
"shares_traded": 32057.0,
"percent_of_etf": 0.1052
}
]
}
],
"trade_notification_date": "1/12/2024",
"sender_email_id": "ark-invest.com",
"email_date_time": "1/12/2024"
}