OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

MiniSearch

MiniSearch 是一个用 JavaScript 编写的、小巧但功能强大的内存全文搜索引擎。它资源占用少，可以轻松在 Node.js 和浏览器中运行。

试试演示应用。

完整的文档和 API 参考在此。关于 MiniSearch 的更多背景信息，包括与其他类似库的对比，请参阅这篇博客文章。

MiniSearch 遵循语义化版本控制，发布和变更记录在更新日志中。

适用场景

MiniSearch 适用于需要全文搜索功能（例如前缀搜索、模糊搜索、排名、字段加权等），但待索引数据可以存放在本地进程内存中的场景。虽然你不能用它来索引整个互联网，但 MiniSearch 能很好地服务于许多令人惊讶的用例。通过将索引存储在本地内存中，MiniSearch 可以离线工作，并且能够快速处理查询，没有网络延迟。

一个突出的用例是 Web 和移动应用中的实时“输入即搜”，在客户端保存索引可以实现快速、响应灵敏的用户界面，无需向搜索服务器发送请求。

特性

内存高效的索引设计，支持内存受限的用例，如移动浏览器。
精确匹配、前缀搜索、模糊匹配、字段加权。
自动建议引擎，用于搜索查询的自动补全。
现代搜索结果排名算法。
文档可以随时添加或从索引中移除。
零外部依赖。

MiniSearch 致力于提供简单的 API，提供构建自定义解决方案的基础模块，同时保持代码库小巧且经过良好测试。

安装

使用 npm：

npm install minisearch

使用 yarn：

yarn add minisearch

然后在你的项目中 require 或 import：

// 如果使用 import：
import MiniSearch from 'minisearch'

// 如果使用 require：
const MiniSearch = require('minisearch')

或者，如果你更喜欢使用 <script> 标签，可以从 CDN 引入 MiniSearch：

<script src="https://cdn.jsdelivr.net/npm/minisearch@7.2.0/dist/umd/index.min.js"></script>

这种情况下，MiniSearch 将作为全局变量出现在你的项目中。

最后，如果你想手动构建库，克隆仓库并运行 yarn build（或 yarn build-minified 获取压缩版本 + 源码映射）。编译后的源码将创建在 dist 文件夹中（提供了 UMD、ES6 和 ES2015 模块版本）。

使用

基本用法

// 示例文档集合
const documents = [
  {
    id: 1,
    title: 'Moby Dick',
    text: 'Call me Ishmael. Some years ago...',
    category: 'fiction'
  },
  {
    id: 2,
    title: 'Zen and the Art of Motorcycle Maintenance',
    text: 'I can see by my watch...',
    category: 'fiction'
  },
  {
    id: 3,
    title: 'Neuromancer',
    text: 'The sky above the port was...',
    category: 'fiction'
  },
  {
    id: 4,
    title: 'Zen and the Art of Archery',
    text: 'At first sight it must seem...',
    category: 'non-fiction'
  },
  // ... 更多文档
]

let miniSearch = new MiniSearch({
  fields: ['title', 'text'], // 用于全文搜索的索引字段
  storeFields: ['title', 'category'] // 随搜索结果返回的字段
})

// 索引所有文档
miniSearch.addAll(documents)

// 使用默认选项搜索
let results = miniSearch.search('zen art motorcycle')
// => [
//   { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', category: 'fiction', score: 2.77258, match: { ... } },
//   { id: 4, title: 'Zen and the Art of Archery', category: 'non-fiction', score: 1.38629, match: { ... } }
// ]

搜索选项

MiniSearch 支持多种选项以实现更高级的搜索行为：

// 仅搜索特定字段
miniSearch.search('zen', { fields: ['title'] })

// 为某些字段加权（此处为 "title"）
miniSearch.search('zen', { boost: { title: 2 } })

// 前缀搜索（这样 'moto' 将匹配 'motorcycle'）
miniSearch.search('moto', { prefix: true })

// 在特定类别内搜索
miniSearch.search('zen', {
  filter: (result) => result.category === 'fiction'
})

// 模糊搜索，此示例中，最大编辑距离为 0.2 * 词项长度，四舍五入到最接近的整数。拼写错误的 'ismael' 将匹配 'ishmael'。
miniSearch.search('ismael', { fuzzy: 0.2 })

// 你可以在初始化时设置默认搜索选项
miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  searchOptions: {
    boost: { title: 2 },
    fuzzy: 0.2
  }
})
miniSearch.addAll(documents)

// 现在默认会执行模糊搜索并对 "title" 加权：
miniSearch.search('zen and motorcycles')

自动建议

MiniSearch 可以根据不完整的查询提供搜索建议：

miniSearch.autoSuggest('zen ar')
// => [ { suggestion: 'zen archery art', terms: [ 'zen', 'archery', 'art' ], score: 1.73332 },
//      { suggestion: 'zen art', terms: [ 'zen', 'art' ], score: 1.21313 } ]

autoSuggest 方法接受与 search 方法相同的选项，因此你可以使用模糊搜索为拼写错误的词提供建议：

miniSearch.autoSuggest('neromancer', { fuzzy: 0.2 })
// => [ { suggestion: 'neuromancer', terms: [ 'neuromancer' ], score: 1.03998 } ]

建议根据该搜索将返回的文档相关性进行排名。

有时，你可能需要过滤自动建议，例如只针对特定类别。你可以通过提供 filter 选项来实现：

miniSearch.autoSuggest('zen ar', {
  filter: (result) => result.category === 'fiction'
})
// => [ { suggestion: 'zen art', terms: [ 'zen', 'art' ], score: 1.21313 } ]

字段提取

默认情况下，文档被假定为简单的键值对象，字段名作为键，字段值作为简单值。为了支持自定义字段提取逻辑（例如用于嵌套字段，或需要在分词前进行处理的非字符串字段值），可以传递一个自定义的字段提取函数作为 extractField 选项：

// 假设我们的文档如下所示：
const documents = [
  { id: 1, title: 'Moby Dick', author: { name: 'Herman Melville' }, pubDate: new Date(1851, 9, 18) },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', author: { name: 'Robert Pirsig' }, pubDate: new Date(1974, 3, 1) },
  { id: 3, title: 'Neuromancer', author: { name: 'William Gibson' }, pubDate: new Date(1984, 6, 1) },
  { id: 4, title: 'Zen in the Art of Archery', author: { name: 'Eugen Herrigel' }, pubDate: new Date(1948, 0, 1) },
  // ... 更多文档
]

// 我们可以通过自定义的 `extractField` 函数来支持嵌套字段 (author.name) 和日期字段 (pubDate)：

let miniSearch = new MiniSearch({
  fields: ['title', 'author.name', 'pubYear'],
  extractField: (document, fieldName) => {
    // 如果字段名是 'pubYear'，从 'pubDate' 中提取年份
    if (fieldName === 'pubYear') {
      const pubDate = document['pubDate']
      return pubDate && pubDate.getFullYear().toString()
    }

    // 访问嵌套字段
    return fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
  }
})

可以通过调用 MiniSearch.getDefault('extractField') 获取默认的字段提取器。

分词

默认情况下，文档通过 Unicode 空格或标点字符进行分词。可以通过传递一个自定义的分词器函数作为 tokenize 选项来轻松更改分词逻辑：

// 按连字符分词
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string, _fieldName) => string.split('-')
})

在搜索时，默认使用相同的分词方式，但如果需要不同的搜索时分词，可以传递 tokenize 搜索选项：

// 按连字符分词
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string) => string.split('-'), // 索引时分词器
  searchOptions: {
    tokenize: (string) => string.split(/[\s-]+/) // 搜索查询分词器
  }
})

可以通过调用 MiniSearch.getDefault('tokenize') 获取默认的分词器。

词项处理

默认情况下，词项会被转换为小写。不进行词干提取，也不应用停用词列表。要自定义索引时如何处理词项，例如进行归一化、过滤或应用词干提取，可以使用 processTerm 选项。processTerm 函数应返回处理后的词项作为字符串，如果应丢弃该词项，则返回假值：

let stopWords = new Set(['and', 'or', 'to', 'in', 'a', 'the', /* ... 更多 */ ])

// 执行自定义词项处理（此处丢弃停用词并转换为小写）
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term, _fieldName) =>
    stopWords.has(term) ? null : term.toLowerCase()
})

默认情况下，对搜索查询应用相同的处理。为了对搜索查询应用不同的处理，可以提供一个 processTerm 搜索选项：

let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term) =>
    stopWords.has(term) ? null : term.toLowerCase(), // 索引词项处理
  searchOptions: {
    processTerm: (term) => term.toLowerCase() // 搜索查询处理
  }
})

可以通过调用 MiniSearch.getDefault('processTerm') 获取默认的词项处理器。

API 文档

有关配置选项和方法的详细信息，请参阅 API 文档。

浏览器和 Node.js 兼容性

MiniSearch 支持所有实现 ES9 (ES2018) JavaScript 标准的浏览器和 Node.js 版本。这包括所有现代浏览器和 Node.js 版本。

通过将分词器正则表达式转译以扩展 Unicode 字符类转义，可以实现 ES6 (ES2015) 兼容性，例如使用 https://babeljs.io/docs/babel-plugin-transform-unicode-sets-regex。

贡献

欢迎为 MiniSearch 做贡献。请阅读贡献指南。阅读设计文档也有助于理解项目目标和技术实现。

项目地址：https://github.com/lucaong/minisearch

4 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

MiniSearch：轻量级本地语义搜索与知识检索工具