向量数据库

向量数据库概述

向量数据库是专门用于存储和检索高维向量的数据库，在RAG系统中负责存储文档的向量表示并提供高效的相似度搜索能力。

┌─────────────────────────────────────────────────────────────┐
│                    向量数据库工作原理                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  文本输入              存储到向量数据库                       │
│     │                                                      │
│     ▼                                                      │
│  ┌─────────┐                                                │
│  │Embedding│  "人工智能是..." → [0.12, -0.34, 0.56...]     │
│  │  模型   │                                                │
│  └─────────┘                                                │
│                                                             │
│  查询输入              相似度搜索                           │
│     │                                                      │
│     ▼                                                      │
│  ┌─────────┐                                                │
│  │Embedding│  "什么是AI？" → [0.15, -0.30, 0.52...]        │
│  │  模型   │                                                │
│  └─────────┘                                                │
│     │                                                      │
│     ▼                                                      │
│  ┌─────────┐                                                │
│  │ 相似度  │─────────▶ 返回最相似的K个结果                  │
│  │ 计算    │                                                │
│  └─────────┘                                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

主流向量数据库对比

数据库	特点	适用场景	部署方式
Milvus	功能全面，性能强	生产环境，大规模	云/私有
Pinecone	托管服务，零运维	快速上线	云服务
Chroma	轻量级，易用	原型开发，小规模	本地
Weaviate	原生多模态	多模态应用	云/私有
Qdrant	Rust实现，高性能	生产环境	云/私有
FAISS	Facebook开源	研究，学术	本地
Milvus Lite	轻量版Milvus	快速原型	本地

Chroma（推荐入门）

安装

bash

# Python
pip install chromadb

# Node.js
npm install chromadb

基本使用

python

import chromadb

# 创建客户端
client = chromadb.Client()

# 创建集合
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}  # cosine/Manhattan/L2
)

# 添加文档
collection.add(
    documents=[
        "人工智能是计算机科学的一个分支",
        "机器学习是人工智能的子领域",
        "深度学习是机器学习的一个分支"
    ],
    ids=["doc1", "doc2", "doc3"],
    metadatas=[
        {"source": "ai-intro", "category": "AI"},
        {"source": "ml-intro", "category": "ML"},
        {"source": "dl-intro", "category": "DL"}
    ]
)

# 查询
results = collection.query(
    query_texts=["什么是机器学习？"],
    n_results=2
)

print(results)

Milvus（生产环境推荐）

安装

bash

pip install pymilvus

Python使用

python

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

# 连接Milvus
connections.connect(host="localhost", port="19530")

# 定义schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]

schema = CollectionSchema(fields=fields, description="文档集合")

# 创建collection
collection = Collection(name="documents", schema=schema)

# 创建索引
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "IP",  # 内积相似度
    "params": {"nlist": 128}
}
collection.create_index(field_name="embedding", index_params=index_params)

# 插入数据
data = [
    ["人工智能简介"],  # content
    [[0.1, 0.2, ...]]  # embedding, 1536维
]
collection.insert(data)

# 搜索
search_params = {"metric_type": "IP", "params": {"nprobe": 10}}
results = collection.search(
    data=[[0.1, 0.2, ...]],
    anns_field="embedding",
    param=search_params,
    limit=5
)

Embedding模型

选择Embedding模型

模型	维度	中文支持	特点
OpenAI text-embedding-3-small	1536	✅	高性价比
OpenAI text-embedding-3-large	3072	✅	效果最好
BGE-small-zh	512	✅ 优秀	国产开源
M3E	1536	✅ 优秀	国产开源

使用示例

python

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

# 使用BGE中文Embedding
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-zh-v1.5",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

# 向量化文本
vector = embeddings.embed_query("这是一个测试文本")
print(f"向量维度: {len(vector)}")

Node.js使用

javascript

import { HNSWLib } from '@langchain/community/vectorstores/hnswlib';
import { OpenAIEmbeddings } from '@langchain/openai';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

// 初始化
const embeddings = new OpenAIEmbeddings({
  model: 'text-embedding-3-small'
});

// 创建向量存储
const vectorStore = await HNSWLib.fromTexts(
  ['人工智能是未来的趋势', '机器学习是AI的重要分支'],
  [{ id: 1 }, { id: 2 }],
  embeddings,
  { docStore: 'docstore.json', indexStore: 'index.bin' }
);

// 相似度搜索
const results = await vectorStore.similaritySearchVectorWithScore(
  await embeddings.embedQuery('什么是AI'),
  2
);

相似度计算

常用距离度量

┌─────────────────────────────────────────────────────────────┐
│                    相似度计算方法                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  📐 余弦相似度 (Cosine Similarity)                          │
│     cos(θ) = A·B / (|A| × |B|)                             │
│     范围: [-1, 1]，值越大越相似                              │
│                                                             │
│  📏 欧氏距离 (L2 Distance)                                  │
│     d = √(Σ(Ai-Bi)²)                                       │
│     越小越相似                                              │
│                                                             │
│  📊 内积 (Inner Product)                                    │
│     A·B = ΣAi×Bi                                            │
│     值越大越相似（需先归一化）                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

选择建议

场景	推荐度量
通用文本相似度	余弦相似度
关注向量长度	余弦相似度
关注绝对距离	欧氏距离
高维向量	内积（配合归一化）

向量数据库最佳实践

1. 合理的Chunk策略

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 按段落和token数切分
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", "。", "！", "？", " "],
    chunk_size=500,        # 目标chunk大小（token）
    chunk_overlap=50,      # 重叠大小，保持上下文连续性
    length_function=len,   # 按字符数计算
    is_separator_regex=False
)

chunks = text_splitter.split_text(long_document)

2. 元数据利用

python

# 添加丰富的元数据
collection.add(
    documents=[chunk],
    ids=[chunk_id],
    metadatas=[{
        "source": "product_manual",      # 来源
        "category": "usage",              # 类别
        "page": 5,                         # 页码
        "created_at": "2024-01-01",       # 创建时间
        "author": "技术部"                 # 作者
    }]
)

# 利用元数据过滤检索
results = collection.query(
    query_texts=["如何使用产品"],
    where={"category": "usage"},  # 元数据过滤
    n_results=5
)

3. 索引优化

python

# Milvus索引配置
index_params = {
    "index_type": "HNSW",  # HNSW: 召回率高，速度快
    "metric_type": "COSINE",
    "params": {
        "M": 16,           # 构建质量，值越大质量越高
        "efConstruction": 200  # 构建参数
    }
}

4. 混合检索

python

async def hybrid_search(query, top_k=5):
    # 1. 语义搜索
    semantic_results = await vector_search(query, top_k * 2)
    
    # 2. 关键词搜索
    keyword_results = await keyword_search(query, top_k * 2)
    
    # 3. RRF融合（Reciprocal Rank Fusion）
    fused_scores = {}
    for rank, (doc, score) in enumerate(semantic_results):
        fused_scores[doc.id] = fused_scores.get(doc.id, 0) + score / (60 + rank)
    
    for rank, (doc, score) in enumerate(keyword_results):
        fused_scores[doc.id] = fused_scores.get(doc.id, 0) + score / (60 + rank)
    
    # 4. 返回融合后结果
    sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:top_k]

向量数据库 ​

向量数据库概述 ​

主流向量数据库对比 ​

Chroma（推荐入门） ​

安装 ​

基本使用 ​

Milvus（生产环境推荐） ​

安装 ​

Python使用 ​

Embedding模型 ​

选择Embedding模型 ​

使用示例 ​

Node.js使用 ​

相似度计算 ​

常用距离度量 ​

选择建议 ​

向量数据库最佳实践 ​

1. 合理的Chunk策略 ​

2. 元数据利用 ​

3. 索引优化 ​

4. 混合检索 ​

向量数据库

向量数据库概述

主流向量数据库对比

Chroma（推荐入门）

安装

基本使用

Milvus（生产环境推荐）

安装

Python使用

Embedding模型

选择Embedding模型

使用示例

Node.js使用

相似度计算

常用距离度量

选择建议

向量数据库最佳实践

1. 合理的Chunk策略

2. 元数据利用

3. 索引优化

4. 混合检索