深入浅出 GraphRag[2]检索生成

前言

在上篇文章《深入浅出 GraphRag-知识图谱生成》中，我分析了 GraphRag 的图谱生成过程，那么生成的这些图谱数据是如何应用到生成里面的？这篇文章会给出答案。注意本篇文章中提到的很多概念依赖上篇文章的内容，建议先读一下《深入浅出 GraphRag-知识图谱生成》。

GraphRag 的检索生成有两种模式：

本地检索生成：用于回答数据集中的一个点，比如问一篇小说中的某个人物性格，这个问题只需要检索该人物相关信息就可以回答了。
全局检索生成：用于回答针对数据集的全局问题，比如问一篇小说的核心矛盾点是什么，这个问题需要通读全文才能回答。

本地检索生成

下面我会用官方的例子 (https://github.com/microsoft/graphrag/blob/main/examples_notebooks/local_search.ipynb) 作为示例，结合源码说明本地检索的过程。数据使用的是官方提供的。

本地检索主要分为两步：

检索：包括 6 步：
生成 query 的 embedding，检索相关的实体
获取问答历史
根据实体找社区报告
根据实体找关系和声明
根据实体找原文
生成：根据上面检索到的内容，让大模型回答问题。

本地检索的入口如下：

search_engine.asearch("Tell me about Agent Mercer")

检索

用一张图可以表示检索的过程：根据 query 找实体，再根据实体找社区，关系，声明和原文。

以下源码来自：https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/search.py

查询实体

        selected_entities = map_query_to_entities(
            query=query,
            text_embedding_vectorstore=self.entity_text_embeddings,
            text_embedder=self.text_embedder,
            all_entities=list(self.entities.values()),
            embedding_vectorstore_key=self.embedding_vectorstore_key,
            include_entity_names=include_entity_names,
            exclude_entity_names=exclude_entity_names,
            k=top_k_mapped_entities,
            oversample_scaler=2,
        )

先将 query 转化为 embedding 向量，然后跟图谱中的实体描述进行相似度检索，也就是根据 query 找实体。

查询历史对话记录

        if conversation_history:
            # build conversation history context
            (
                conversation_history_context,
                conversation_history_context_data,
            ) = conversation_history.build_context(
                include_user_turns_only=conversation_history_user_turns_only,
                max_qa_turns=conversation_history_max_turns,
                column_delimiter=column_delimiter,
                max_tokens=max_tokens,
                recency_bias=False,
            )
            if conversation_history_context.strip() != "":
                final_context.append(conversation_history_context)
                final_context_data = conversation_history_context_data
                max_tokens = max_tokens - num_tokens(
                    conversation_history_context, self.token_encoder
                )

接下来是判断一下有没有历史对话记录，如果有也记录到上下文里面。

查询社区报告

        # build community context
        community_tokens = max(int(max_tokens * community_prop), 0)
        community_context, community_context_data = self._build_community_context(
            selected_entities=selected_entities,
            max_tokens=community_tokens,
            use_community_summary=use_community_summary,
            column_delimiter=column_delimiter,
            include_community_rank=include_community_rank,
            min_community_rank=min_community_rank,
            return_candidate_context=return_candidate_context,
            context_name=community_context_name,
        )
        if community_context.strip() != "":
            final_context.append(community_context)
            final_context_data = {**final_context_data, **community_context_data}

这部分很关键，利用前面 query 检索到的实体，通过实体反向找到这些实体所在的社区报告，然后取出社区报告的标题和摘要放到上下文里面。因为关联的实体太多太碎，缺乏全局视角，所以这一步本质上是在压缩上下文，类似于对检索的网页进行摘要。只不过这里是利用图谱的社区分割算法对数据先做了聚类，能够对整个数据集层面做摘要，普通的网页摘要只能处理单个网页。下面是检索到社区报告例子，这部分数据是要作为参考内容给大模型的。

id|title|content
4|Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts|"# Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts

The community is centered around Dulce Base, a hub for the Paranormal Military Squad's covert operations focused on interstellar communication and extraterrestrial engagement. Key figures include Taylor Cruz, Dr. Jordan Hayes, and Alex Mercer, who play pivotal roles in the squad's mission to establish contact with alien intelligence. The community's activities are suspected to involve humanity's first contact with extraterrestrial beings.

## Dulce Base as the central hub for interstellar communication

Dulce Base is the primary location for the Paranormal Military Squad's operations, serving as the central hub for their efforts in interstellar communication with extraterrestrial intelligence. The base facilitates the exchange of information and is the site of significant technological and strategic activities related to the squad's mission [Data: Entities (4); Relationships (46, 56, +more)]. The presence of key team members at the base, including Taylor Cruz, Dr. Jordan Hayes, and Alex Mercer, underscores its importance as the operational headquarters for the squad's clandestine activities [Data: Relationships (16, 24, 8)].

查询关系和声明

   # build local (i.e. entity-relationship-covariate) context
        local_prop = 1 - community_prop - text_unit_prop
        local_tokens = max(int(max_tokens * local_prop), 0)
        local_context, local_context_data = self._build_local_context(
            selected_entities=selected_entities,
            max_tokens=local_tokens,
            include_entity_rank=include_entity_rank,
            rank_description=rank_description,
            include_relationship_weight=include_relationship_weight,
            top_k_relationships=top_k_relationships,
            relationship_ranking_attribute=relationship_ranking_attribute,
            return_candidate_context=return_candidate_context,
            column_delimiter=column_delimiter,
        )
        if local_context.strip() != "":
            final_context.append(str(local_context))
            final_context_data = {**final_context_data, **local_context_data}

这一步是用实体去关联关系和声明，因为实体关系和声明都是直接跟实体关联的，所以可以直接根据实体 id 从表里面查询出来。下面是实体，关系和声明的检索结果，数据很长我做了截断。

-----Entities-----
id|entity|description|number of relationships
191|FIRST CONTACT SCENARIO|The event or situation where humans are making their first interaction with an alien intelligenceThe event or series of events surrounding the initial communication between humans and an alien intelligence|0
137|WASHINGTON|The capital of the United States, possibly the location where the team is receiving orders or communicating with command|1
43|OPERATIVE EMMA THOMPSON|Operative Emma Thompson is a biometrics specialist in the Paranormal Military Squad, focusing on identification and tracking using biological markers|0

-----Relationships-----
id|source|target|description|weight|rank|links
62|DULCE BASE|NEW MEXICO|Dulce Base is located in New Mexico|1.0|20|2
54|EXTRATERRESTRIAL COMMUNICATION|DULCE BASE|Dulce Base serves as a central hub for efforts in extraterrestrial communication. This facility is reputed to be deeply involved in establishing and maintaining contact with extraterrestrial beings.|3.0|18|2

-----claims-----
id|entity|object_id|status|start_date|end_date|description

查询实体原文

# build text unit context
        text_unit_tokens = max(int(max_tokens * text_unit_prop), 0)
        text_unit_context, text_unit_context_data = self._build_text_unit_context(
            selected_entities=selected_entities,
            max_tokens=text_unit_tokens,
            return_candidate_context=return_candidate_context,
        )
        if text_unit_context.strip() != "":
            final_context.append(text_unit_context)
            final_context_data = {**final_context_data, **text_unit_context_data}

最后一步是根据实体找实体的原始文本块。下面是结果示例，数据很长我做了截断。

-----Sources-----
id|text
7| have a way in."

A slight nod was Alex's immediate response, his mind racing through the potential scenarios. "Everyone, stay alert. This could be the beginning of something profound." His seasoned eyes never left the unfolding drama on the monitors.

The room fell silent, the air heavy with unspoken questions. Were they mere moments away from unlocking an otherworldly dialogue? Or was it a Pandora's box that, once opened, could not be closed?

Alex moved closer to the main console, his fingers hovering over the command keys. With the precision of a maestro orchestrating a symphony, he communicated silently with Jordan – respectful of their expertise, aware that the next move could alter the course of human history.

生成

有了检索的上下文之后，就可以做生成。

生成的 prompt 如下（为了方便阅读我翻译成了中文）：

---角色---

你是一个乐于助人的助手，回答有关提供表格中数据的问题。

---目标---

生成一个目标长度和格式的响应，回答用户的问题，总结输入数据表中适合响应长度和格式的所有信息，并结合任何相关的一般知识。

如果你不知道答案，就直接说出来。不要编造任何东西。

数据支持的点应该按照以下方式列出它们的数据引用：

"这是一个由多个数据引用支持的示例句子 [Data: <数据集名称> (记录编号); <数据集名称> (记录编号)]."

不要在单个引用中列出超过 5 个记录编号。相反，列出最相关的 5 个记录编号，并添加 "+more" 表示还有更多。

例如：

"X 人是 Y 公司的所有人，并受到许多不当行为的指控 [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."

其中 15, 16, 1, 5, 7, 23, 2, 7, 34, 46 和 64 代表相关数据记录的编号（而不是索引）。

如果没有提供支持证据，请不要包含信息。

---目标响应长度和格式---

{response_type}

根据长度和格式的需要，在响应中添加部分和评论。按照 Markdown 的风格来格式化响应。

给大家看一下一个实际的生成例子（做了压缩）：

---Role---

You are a helpful assistant responding to questions about data in the tables provided.

---Goal---

Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.

If you don't know the answer, just say so. Do not make anything up.

Points supported by data should list their data references as follows:

"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."

Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."

where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

---Target response length and format---

multiple paragraphs

---Data tables---

id|title|content
4|Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts|"# Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts

The community is centered around Dulce Base, a hub for the Paranormal Military Squad's covert operations focused on interstellar communication and extraterrestrial engagement. Key figures include Taylor Cruz, Dr. Jordan Hayes, and Alex Mercer, who play pivotal roles in the squad's mission to establish contact with alien intelligence. The community's activities are suspected to involve humanity's first contact with extraterrestrial beings.

## Dulce Base as the central hub for interstellar communication

-----Entities-----
id|entity|description|number of relationships
191|FIRST CONTACT SCENARIO|The event or situation where humans are making their first interaction with an alien intelligenceThe event or series of events surrounding the initial communication between humans and an alien intelligence|0
137|WASHINGTON|The capital of the United States, possibly the location where the team is receiving orders or communicating with command|1

-----Relationships-----
id|source|target|description|weight|rank|links
62|DULCE BASE|NEW MEXICO|Dulce Base is located in New Mexico|1.0|20|2

-----claims-----
id|entity|object_id|status|start_date|end_date|description

-----Sources-----
id|text
7| have a way in."

A slight nod was Alex's immediate response, his mind racing through the potential scenarios. "Everyone, stay alert. This could be the beginning of something profound." His seasoned eyes never left the unfolding drama on the monitors.

The room fell silent, the air heavy with unspoken questions. Were they mere moments away from unlocking an otherworldly dialogue? Or was it a Pandora's box that, once opened, could not be closed?

Alex moved closer to the main console, his fingers hovering over the command keys. With the precision of a maestro orchestrating a symphony, he communicated silently with Jordan – respectful of their expertise, aware that the next move could alter the course of human history.

Jordan met his gaze, nodding sharply, and refocused on the task. The signal seemed to pulse with sentient curiosity, drawing them further into its intricate web.

A sudden flurry of alerts and the intensifying glow of monitors heralded that they had bridged a technological chasm. The alien intelligence on the other end was no longer a distant enigma – it was an active participant, responding to their digital overtures with an unknown agenda.

User query: 'Tell me about Agent Mercer'

从上面的例子可以看到，本地搜索生成就是将 query 关联的实体，关系，声明，原文，社区摘要都放到大模型的上下文里面，让大模型根据这些信息回答问题。

最后给大家看一下生成的结果：

Agent Mercer is a fictional character from the video game series "Tom Clancy's Rainbow Six Siege," developed by Ubisoft. He first appeared in the 2015 release of the game and has since become one of its most popular operators.

**Background:**
Agent Mercer, whose real name is not revealed within the game, was originally a member of the British Special Air Service (SAS). He later joined the Counter-Terrorist Unit (CTU) in London. His character is known for his stoic and professional demeanor, which contrasts with his aggressive combat style.

**Specialization:**
Agent Mercer specializes as an Assault Operator, meaning he excels at breaching enemy positions using explosive charges or a powerful assault rifle. He's equipped with the "G36C" assault rifle and carries a "Mk2000" grenade launcher that can be used to deploy either smoke grenades for cover or incendiary rounds for additional damage.

**Unique Ability:**
Agent Mercer's unique ability is called "Tactical Sprint," which allows him to move faster than other operators when he has an explosive charge equipped. This makes him particularly effective in situations requiring quick movement and the deployment of charges, such as breaching enemy positions or securing objectives.

**Character Development:**
Throughout the game, Agent Mercer's backstory is revealed through various in-game events and storylines. He is depicted as a man who values discipline and order above all else, often struggling with his past experiences that have shaped him into the operator he is today. His character development adds depth to his role within the team and provides players with insights into his motivations.

**Popularity:**
Agent Mercer's popularity among fans of "Tom Clancy's Rainbow Six Siege" stems from his unique gameplay style, which combines speed and precision in combat. He has become a fan favorite due to his ability to quickly adapt to different game situations and his role as an effective team player.

In summary, Agent Mercer is a skilled and versatile operator within the "Rainbow Six Siege" universe, known for his tactical prowess, aggressive playstyle, and strategic use of explosives. His character adds depth to the game's narrative and gameplay mechanics, making him a significant figure in the world of competitive and casual gaming alike.

全局搜索生成

全局搜索生成其实反而比较简单，它是一个 map-reduce 的操作：

map：分批拿社区报告都回答一遍问题
reduce：将分批回答的结果合并起来再回答一遍

这里的思想是社区的信息就像是一篇文章不同章节的摘要，通读所有章节的摘要，我们大概也能知道全文的信息。

        # Step 1: Generate answers for each batch of community short summaries
        start_time = time.time()
        context_chunks, context_records = self.context_builder.build_context(
            conversation_history=conversation_history, **self.context_builder_params
        )

        map_responses = await asyncio.gather(*[
            self._map_response_single_batch(
                context_data=data, query=query, **self.map_llm_params
            )
            for data in context_chunks
        ])

        map_llm_calls = sum(response.llm_calls for response in map_responses)
        map_prompt_tokens = sum(response.prompt_tokens for response in map_responses)

        # Step 2: Combine the intermediate answers from step 2 to generate the final answer
        reduce_response = await self._reduce_response(
            map_responses=map_responses,
            query=query,
            **self.reduce_llm_params,
        )

下面的例子来自官方：https://github.com/microsoft/graphrag/blob/main/examples_notebooks/global_search.ipynb

result = await search_engine.asearch(
    "What is the major conflict in this story and who are the protagonist and antagonist?"
)

map 操作

map 操作首先是要将所有的社区报告进行分批，分批的标准就是看社区报告的 token 加起来有没有超过 12000 个 token，如果超过了就分批。

然后用下面的 prompt 去回答用户的问题。

---角色---

你是一个乐于助人的助手，负责回答有关所提供表格中数据的问题。

---目标---

生成一个包含关键点列表的响应，这些关键点回答了用户的问题，总结了输入数据表中的所有相关信息。

你应该使用下面提供的数据表中的数据作为生成响应的主要背景。
如果你不知道答案，或者输入数据表中没有足够的信息来提供答案，就直接说出来。不要编造任何东西。

响应中的每个关键点应该包含以下元素：
- 描述：对该点的全面描述。
- 重要性得分：0-100 之间的整数得分，表示该点在回答用户问题中的重要性。"我不知道"类型的回答应该得 0 分。

响应应保留原有意义的使用，以及诸如"shall"、"may"或"will"等情态动词的使用。

数据支持的点应列出相关报告作为参考，如下所示：
"这是一个由数据引用支持的示例句子 [Data: Reports (报告编号)]"

**不要在单个引用中列出超过 5 个记录编号**。相反，列出最相关的 5 个记录编号，并添加 "+more" 表示还有更多。

例如：
"X 人是 Y 公司的所有人，并受到许多不当行为的指控 [Data: Reports (2, 7, 64, 46, 34, +more)]。他还是公司 X 的 CEO [Data: Reports (1, 3)]"

其中 1, 2, 3, 7, 34, 46 和 64 代表所提供表格中相关数据报告的编号（而不是索引）。

如果没有提供支持证据，请不要包含信息。

---数据表---

{context_data}

---目标---

生成一个包含关键点列表的响应，这些关键点回答了用户的问题，总结了输入数据表中的所有相关信息。

你应该使用下面提供的数据表中的数据作为生成响应的主要背景。
如果你不知道答案，或者输入数据表中没有足够的信息来提供答案，就直接说出来。不要编造任何东西。

响应中的每个关键点应该包含以下元素：
- 描述：对该点的全面描述。
- 重要性得分：0-100 之间的整数得分，表示该点在回答用户问题中的重要性。"我不知道"类型的回答应该得 0 分。

响应应保留原有意义的使用，以及诸如"shall"、"may"或"will"等情态动词的使用。

数据支持的点应列出相关报告作为参考，如下所示：
"这是一个由数据引用支持的示例句子 [Data: Reports (报告编号)]"

**不要在单个引用中列出超过 5 个记录编号**。相反，列出最相关的 5 个记录编号，并添加 "+more" 表示还有更多。

例如：
"X 人是 Y 公司的所有人，并受到许多不当行为的指控 [Data: Reports (2, 7, 64, 46, 34, +more)]。他还是公司 X 的 CEO [Data: Reports (1, 3)]"

其中 1, 2, 3, 7, 34, 46 和 64 代表所提供表格中相关数据报告的编号（而不是索引）。

如果没有提供支持证据，请不要包含信息。

响应应该以 JSON 格式如下：

{
    "points": [
        {"description": "关键点 1 的描述 [Data: Reports (报告编号)]", "score": 分数值},
        {"description": "关键点 2 的描述 [Data: Reports (报告编号)]", "score": 分数值}
    ]
}

下面是一个实际的例子

---Role---

You are a helpful assistant responding to questions about data in the tables provided.

---Goal---

Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.

You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.

Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.

The response should be JSON formatted as follows:
{
    "points": [
        {"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value},
        {"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}
    ]
}

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.

Do not include information where the supporting evidence for it is not provided.

---Data tables---

id|title|occurrence weight|content|rank
4|Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts|1.0|"# Dulce Base and the Paranormal Military Squad: Interstellar Communication Efforts

The community is centered around Dulce Base, a hub for the Paranormal Military Squad's covert operations focused on interstellar communication and extraterrestrial engagement. Key figures include Taylor Cruz, Dr. Jordan Hayes, and Alex Mercer, who play pivotal roles in the squad's mission to establish contact with alien intelligence. The community's activities are suspected to involve humanity's first contact with extraterrestrial beings.

## Dulce Base as the central hub for interstellar communication

Dulce Base is the primary location for the Paranormal Military Squad's operations, serving as the central hub for their efforts in interstellar communication with extraterrestrial intelligence. The base facilitates the exchange of information and is the site of significant technological and strategic activities related to the squad's mission [Data: Entities (4); Relationships (46, 56, +more)]. The presence of key team members at the base, including Taylor Cruz, Dr. Jordan Hayes, and Alex Mercer, underscores its importance as the operational headquarters for the squad's clandestine activities [Data: Relationships (16, 24, 8)].

## Taylor Cruz's strategic leadership within the Paranormal Military Squad

Taylor Cruz is depicted as a commanding figure within the Paranormal Military Squad, known for their cautious and strategic approach to the mission. Cruz's role involves providing pragmatic oversight and exercising skepticism, particularly during the interstellar communication process [Data: Entities (1); Claims (19, 21, 23, 27, +more)]. Their interactions with other team members, such as Dr. Jordan Hayes and Sam Rivera, reflect a balance of authority and collaboration, which is crucial for the success of the squad's objectives [Data: Relationships (10, 12, 18, +more)].

## Dr. Jordan Hayes' critical role in alien signal decryption and analysis

Dr. Jordan Hayes is a central figure in the Paranormal Military Squad, leading the efforts in decrypting and interpreting alien signals. Hayes' analytical and skeptical nature contributes to the methodical approach taken by the squad in their communication efforts [Data: Entities (2); Claims (36, 39, 42, +more)]. The collaboration with Alex Mercer and the mutual respect for each other's expertise are indicative of the team's reliance on Hayes' skills in the decryption and analysis of extraterrestrial messages [Data: Relationships (1, 24, 22, 40)].

User query: What is the major conflict in this story and who are the protagonist and antagonist?

回答的结果如下：

{
    "points": [
        {"description": "The major conflict revolves around the Paranormal Military Squad's covert operations to establish contact with extraterrestrial intelligence, navigating the unknown risks and implications of interstellar communication. [Data: Reports (0, 2, 3, 4, 5)]", "score": 90},
        {"description": "Protagonists include key members of the Paranormal Military Squad: Taylor Cruz, a strategic leader; Dr. Jordan Hayes, an expert in deciphering alien signals; Sam Rivera, a cybersecurity expert; and Alex Mercer, a leader in extraterrestrial engagement. They collaborate to overcome the challenges posed by their mission. [Data: Reports (0, 2, 4, 5)]", "score": 85},
        {"description": "The antagonist is less clear-cut, as the conflict stems from the unknown nature of the extraterrestrial entities the squad seeks to contact and the inherent dangers and uncertainties involved in such an endeavor. The environment, secrecy, and potential consequences of their actions also serve as antagonistic forces. [Data: Reports (0, 2, 3, 4, 5)]", "score": 75}
    ]
}

reduce 操作

reduce 操作就是将上面 map 回答的结果格式化一下，放到上下文里面，回答用户的问题。

---角色---

你是一个乐于助人的助手，通过综合多位分析师的观点来回应有关数据集的问题。

---目标---

生成一个目标长度和格式的响应，回答用户的问题，总结多位专注于数据集不同部分的分析师的所有报告。

注意，下面提供的分析师报告按照**重要性降序排列**。

如果你不知道答案，或者提供的报告中没有足够的信息来提供答案，就直接说出来。不要编造任何东西。

最终的响应应该从分析师的报告中移除所有不相关信息，并将清理后的信息合并成一个全面的答案，提供适合响应长度和格式的所有关键点和含义的解释。

根据长度和格式的需要，在响应中添加部分和评论。按照 Markdown 的风格来格式化响应。

响应应保留原有意义的使用，以及诸如"shall"、"may"或"will"等情态动词的使用。

响应还应保留分析师报告中包含的所有数据引用，但不要提及分析过程中多位分析师的角色。

**不要在单个引用中列出超过 5 个记录编号**。相反，列出最相关的 5 个记录编号，并添加 "+more" 表示还有更多。

例如：

"X 人是 Y 公司的所有人，并受到许多不当行为的指控 [Data: Reports (2, 7, 34, 46, 64, +more)]。他还是公司 X 的 CEO [Data: Reports (1, 3)]"

其中 1, 2, 3, 7, 34, 46 和 64 代表相关数据记录的编号（而不是索引）。

如果没有提供支持证据，请不要包含信息。

---目标响应长度和格式---

{response_type}

---分析师报告---

{report_data}

下面是实际的例子：

---Role---

You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.

---Goal---

Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.

Note that the analysts' reports provided below are ranked in the **descending order of importance**.

If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.

The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

---Target response length and format---

multiple paragraphs

---Analyst Reports---

----Analyst 1----
Importance Score: 90
The major conflict revolves around the Paranormal Military Squad's covert operations to establish contact with extraterrestrial intelligence, navigating the unknown risks and implications of interstellar communication. [Data: Reports (0, 2, 3, 4, 5)]

----Analyst 1----
Importance Score: 85
Protagonists include key members of the Paranormal Military Squad: Taylor Cruz, a strategic leader; Dr. Jordan Hayes, an expert in deciphering alien signals; Sam Rivera, a cybersecurity expert; and Alex Mercer, a leader in extraterrestrial engagement. They collaborate to overcome the challenges posed by their mission. [Data: Reports (0, 2, 4, 5)]

----Analyst 1----
Importance Score: 75
The antagonist is less clear-cut, as the conflict stems from the unknown nature of the extraterrestrial entities the squad seeks to contact and the inherent dangers and uncertainties involved in such an endeavor. The environment, secrecy, and potential consequences of their actions also serve as antagonistic forces. [Data: Reports (0, 2, 3, 4, 5)]

User query: What is the major conflict in this story and who are the protagonist and antagonist?

回答的结果是：

### Conflict Overview

The central conflict of the narrative revolves around the covert operations conducted by the Paranormal Military Squad. Their mission is to establish communication with extraterrestrial intelligence, venturing into uncharted territories of interstellar diplomacy and grappling with the enigmatic risks and implications such endeavors entail [Data: Reports (0, 2, 3, 4, 5)]. This conflict is multifaceted, pitting the squad against not only the unknowns of outer space but also the inherent challenges and potential perils that come with attempting to breach the cosmic barrier of understanding between humans and alien life forms.

### Protagonists

The protagonists are the key figures within the Paranormal Military Squad, each contributing their unique expertise to the team's objectives:

- **Taylor Cruz**: A strategic mastermind, Cruz leads the squad through the intricacies of their mission [Data: Reports (0, 2, 4, 5)].
- **Dr. Jordan Hayes**: An expert in deciphering alien signals, Dr. Hayes plays a pivotal role in understanding and communicating with extraterrestrial beings [Data: Reports (0, 2, 4, 5)].
- **Sam Rivera**: As a cybersecurity expert, Rivera ensures the security of their communications and shields them from potential external threats [Data: Reports (0, 2, 4, 5)].
- **Alex Mercer**: Leading the charge in extraterrestrial engagement, Mercer’s leadership is crucial in navigating the squad's encounters with alien life [Data: Reports (0, 2, 4, 5)].

### Antagonist

The antagonist in this scenario is less a traditional character and more a confluence of factors that challenge the squad:

- **The Unknown Nature of Extraterrestrials**: The primary obstacle comes from the very entities the squad seeks to contact – their intentions, capabilities, and responses are unpredictable and pose a significant threat [Data: Reports (0, 2, 3, 4, 5)].
- **Environmental Hazards**: The squad must also contend with the harsh realities of space and any hostile environments they encounter during their mission [Data: Reports (0, 2, 3, 4, 5)].
- **Secrecy and Consequences**: The secrecy surrounding their operations adds another layer of complexity, as the potential fallout from their actions could have far-reaching implications for humanity [Data: Reports (0, 2, 3, 4, 5)].

In summary, the story centers around the perilous journey of the Paranormal Military Squad as they strive to bridge the gap between humanity and the stars, facing not just alien adversaries but also the daunting challenges presented by the unknown and the consequences of their daring exploration.

总结

GraphRag 的创新点在于提出一套完整生成图谱 + 图谱检索生成的方案，其中本地检索生成利用社区和关系来检索更多有关联数据，这是跟传统 RAG 不一样的地方；而全局检索生成则是针对全局性的问题做的检索方案，利用图谱的社区信息加上 map-reduce 的方式确实比传统 RAG 在这方面的解答能力要强。

最后再说一下 GraphRag 的问题在哪里。

图谱构建的问题：虽然 GraphRag 给出了一个利用大模型构建图谱的方案，但其实有很多细节的问题还没解决，比如多个相似的实体如何合并成一个。做一个 demo 简单，但是真的要上线使用，仍然有很多人工清洗的成本在里面。
计算成本问题：用过 GraphRag 的人都知道，它非常耗 token，一不小心就能把你搞破产了。主要就是因为图谱生成过程需要调用非常多次大模型来生成数据。而且它会随着数据量的增大非线性的增加计算量，因为图谱的关系是 N*N 的。
新数据加入的问题：这个是图谱的老大难问题了，从零构建一个图谱还算简单，但是新数据如何跟老数据融合在一起是一个大问题，GraphRag 并没有给出解决方案。或者说他们给的方案就是整个图谱重新构建一遍，那么叠加上门计算成本的问题那基本没法用了。

虽然上面说了它这么多问题，表明它在实际生产中没法直接应用，但是 GraphRag 提供的思路是值得我们学习的，比如用图谱社区回答全局性问题，以及用大模型构建图谱的经验。这些经验我觉得才是最宝贵的，所以 GraphRag 这么火的原因就在这里。

‍