タグ: Solr

[Solr] Cross Collection Join

はじめに

Solr 8.6 以降ではコレクションをまたがった Join クエリを実行することができます。

Cross Collection Join

メインの検索対象を collection1、Join 対象を collection2 とします。Cross Collection Join は、collection2 に対する検索条件にヒットするデータの join key を持つ collection1 側のデータを取得する一種のフィルタクエリとして機能します。注意が必要なのは collection2 側のデータのフィールドは取得できないということです。collection2 は別のサーバで動く Solr 上にあっても構いません。

動作確認

Cross Collection Join の動作を確認するため、2つの Solr プロセスを起動します。

bin/solr start -cloud -p 8983
bin/solr start -cloud -p 7574

8983 の方は solrconfig.xml に以下を記載します。

<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
    <arr name="allowSolrUrls">
      <str>http://localhost:7574/solr</str>
    </arr>
</queryParser>

8983 の collection1 には以下のようなデータが入っています。

$ curl -s 'http://localhost:8983/solr/collection1/select?q.op=OR&q=*:*&rows=5'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "q.op":"OR",
      "rows":"5",
      "_":"1625974694981"}},
  "response":{"numFound":9238,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"10",
        "type_str":"官公庁",
        "name_str":"軽自動車検査協会大阪主管事務所",
        "_version_":1704977365000519680},
      {
        "id":"11",
        "type_str":"官公庁",
        "name_str":"大阪陸運支局なにわ自動車検査登録事務所",
        "_version_":1704977365035122688},
      {
        "id":"12",
        "type_str":"官公庁",
        "name_str":"住吉税務署",
        "_version_":1704977365036171264},
      {
        "id":"13",
        "type_str":"官公庁",
        "name_str":"玉出年金事務所",
        "_version_":1704977365036171265},
      {
        "id":"14",
        "type_str":"官公庁",
        "name_str":"大阪南労働基準監督署",
        "_version_":1704977365037219840}]
  }}

7574 の collection2 には以下のようなデータが入っています。

$ curl -s 'http://localhost:7574/solr/collection2/select?q.op=OR&q=*:*&rows=5'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "q.op":"OR",
      "rows":"5",
      "_":"1625993148833"}},
  "response":{"numFound":9238,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"10",
        "area_str":"住之江区",
        "address_str":"住之江区南港東3-4-62",
        "address_p":"34.6164938333333,135.438210722222",
        "_version_":1704977387371888640},
      {
        "id":"11",
        "area_str":"住之江区",
        "address_str":"住之江区南港東3-1-14",
        "address_p":"34.6190439722222,135.442191833333",
        "_version_":1704977387419074560},
      {
        "id":"12",
        "area_str":"住吉区",
        "address_str":"住吉区住吉2丁目17番37号",
        "address_p":"34.6109641111111,135.491388722222",
        "_version_":1704977387420123136},
      {
        "id":"13",
        "area_str":"住之江区",
        "address_str":"住之江区北加賀屋2-3-6",
        "address_p":"34.6231918888888,135.477992138888",
        "_version_":1704977387420123137},
      {
        "id":"14",
        "area_str":"西成区",
        "address_str":"西成区玉出中2丁目13番27号",
        "address_p":"34.6237640555555,135.4924",
        "_version_":1704977387421171712}]
  }}

中央区にある「文化・観光」の施設を検索してみます。施設タイプは collection1、エリア情報は collection2 にあるので Cross Collection Join の出番です。

type_str:文化・観光 {!join method="crossCollection" fromIndex="collection2" from="id" to="id" solrUrl="http://localhost:7574/solr" v="area_str:中央区"}

solrUrl で指定した Solr のコレクション collection2 で area_str:中央区を検索して collection1 で type_str:文化・観光を検索した結果と JOIN する、join key はそれぞれの id フィールド、というクエリです。

$ curl -s 'http://localhost:8983/solr/collection1/select?q=type_str%3A%E6%96%87%E5%8C%96%E3%83%BB%E8%A6%B3%E5%85%89%20%7B!join%20method%3D%22crossCollection%22%20fromIndex%3D%22collection2%22%20from%3D%22id%22%20to%3D%22id%22%20solrUrl%3D%22http%3A%2F%2Flocalhost%3A7574%2Fsolr%22%20v%3D%22area_str%3A%E4%B8%AD%E5%A4%AE%E5%8C%BA%22%7D'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":0,
    "params":{
      "q":"type_str:文化・観光 {!join method=\"crossCollection\" fromIndex=\"collection2\" from=\"id\" to=\"id\" solrUrl=\"http://localhost:7574/solr\" v=\"area_str:中央区\"}",
      "q.op":"AND",
      "debugQuery":"false",
      "_":"1625975185887"}},
  "response":{"numFound":56,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"3101",
        "type_str":"文化・観光",
        "name_str":"大阪市立島之内図書館",
        "_version_":1704977365352841232},
      {
        "id":"3136",
        "type_str":"文化・観光",
        "name_str":"上方浮世絵館",
        "_version_":1704977365354938370},
      {
        "id":"3137",
        "type_str":"文化・観光",
        "name_str":"府立上方演芸資料館（ワッハ上方）",
        "_version_":1704977365354938371},
      {
        "id":"3138",
        "type_str":"文化・観光",
        "name_str":"国立文楽劇場",
        "_version_":1704977365354938372},
      {
        "id":"3140",
        "type_str":"文化・観光",
        "name_str":"湯木美術館",
        "_version_":1704977365354938374},
      {
        "id":"3143",
        "type_str":"文化・観光",
        "name_str":"適塾",
        "_version_":1704977365354938377},
      {
        "id":"3144",
        "type_str":"文化・観光",
        "name_str":"府立現代美術センター",
        "_version_":1704977365354938378},
      {
        "id":"3145",
        "type_str":"文化・観光",
        "name_str":"大阪歴史博物館",
        "_version_":1704977365354938379},
      {
        "id":"3146",
        "type_str":"文化・観光",
        "name_str":"ピースおおさか：大阪国際平和センター",
        "_version_":1704977365354938380},
      {
        "id":"3147",
        "type_str":"文化・観光",
        "name_str":"大阪城天守閣",
        "_version_":1704977365354938381}]
  }}

[Solr]コレクションのインクリメンタルバックアップ

はじめに

Solr 8.9から SolrCloud におけるコレクションのバックアップの形式が変更されてインクリメンタルバックアップ対応となりました。

バックアップ

インクリメンタルバックアップ対応に伴い、コレクションAPIのBACKUPアクションのパラメータに以下の2つが追加されました。

incremental

true (デフォルト)ならインクリメンタルバックアップ、false なら従来と同様のスナップショットによるバックアップとなります。スナップショットによるバックアップは近いうちに廃止となる予定です。

maxNumBackupPoints

インクリメンタルバックアップではバックアップポイントという形で直前のバックアップとの差分が保存されていきます。
最大で何個をバックアップポイントを保存するのかを maxNumBackupPoints で指定します。

サンプル

サンプルとして大阪の施設情報を利用します。

以下のようなデータを1件ずつ追加して、その都度バックアップします。

os_1.json
[{"id":"10","type":"官公庁","area":"住之江区","name":"軽自動車検査協会大阪主管事務所","address":"住之江区南港東3-4-62","address_p":"34.6164938333333,135.438210722222"}]

os_2.json
[{"id":"11","type":"官公庁","area":"住之江区","name":"大阪陸運支局なにわ自動車検査登録事務所","address":"住之江区南港東3-1-14","address_p":"34.6190439722222,135.442191833333"}]

os_3.json
[{"id":"12","type":"官公庁","area":"住吉区","name":"住吉税務署","address":"住吉区住吉2丁目17番37号","address_p":"34.6109641111111,135.491388722222"}]

os_4.json
[{"id":"13","type":"官公庁","area":"住之江区","name":"玉出年金事務所","address":"住之江区北加賀屋2-3-6","address_p":"34.6231918888888,135.477992138888"}]

$ curl 'http://localhost:8983/solr/backup_test/update?commit=true&indent=true' --data-binary @os_1.json -H 'Content-Type: application/json'

$ curl 'http://localhost:8983/solr/admin/collections?omitHeader=true&action=BACKUP&name=backup1&collection=backup_test&location=/tmp/solr_backup&incremental=true&maxNumBackupPoints=3'
{
  "success":{
    "127.0.1.1:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":2},
      "response":[
        "startTime","2021-06-28T14:00:11.906389Z",
        "indexFileCount",18,
        "uploadedIndexFileCount",18,
        "indexSizeMB",0.005,
        "uploadedIndexFileMB",0.005,
        "shard","shard1",
        "endTime","2021-06-28T14:00:11.908343Z",
        "shardBackupId","md_shard1_1"]}},
  "response":[
    "collection","backup_test",
    "numShards",1,
    "backupId",1,
    "indexVersion","8.9.0",
    "startTime","2021-06-28T14:00:11.903010Z",
    "indexSizeMB",0.005]}

$ curl 'http://localhost:8983/solr/backup_test/update?commit=true&indent=true' --data-binary @os_2.json -H 'Content-Type: application/json'

$ curl 'http://localhost:8983/solr/admin/collections?omitHeader=true&action=BACKUP&name=backup1&collection=backup_test&location=/tmp/solr_backup&incremental=true&maxNumBackupPoints=3'
{
  "success":{
    "127.0.1.1:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":2},
      "response":[
        "startTime","2021-06-28T14:00:36.142084Z",
        "indexFileCount",35,
        "uploadedIndexFileCount",18,
        "indexSizeMB",0.01,
        "uploadedIndexFileMB",0.005,
        "shard","shard1",
        "endTime","2021-06-28T14:00:36.144073Z",
        "shardBackupId","md_shard1_2"]}},
  "response":[
    "collection","backup_test",
    "numShards",1,
    "backupId",2,
    "indexVersion","8.9.0",
    "startTime","2021-06-28T14:00:36.137745Z",
    "indexSizeMB",0.01]}

$ curl 'http://localhost:8983/solr/backup_test/update?commit=true&indent=true' --data-binary @os_3.json -H 'Content-Type: application/json'

$ curl 'http://localhost:8983/solr/admin/collections?omitHeader=true&action=BACKUP&name=backup1&collection=backup_test&location=/tmp/solr_backup&incremental=true&maxNumBackupPoints=3'
{
  "success":{
    "127.0.1.1:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":4},
      "response":[
        "startTime","2021-06-28T14:01:38.399993Z",
        "indexFileCount",52,
        "uploadedIndexFileCount",18,
        "indexSizeMB",0.015,
        "uploadedIndexFileMB",0.005,
        "shard","shard1",
        "endTime","2021-06-28T14:01:38.404263Z",
        "shardBackupId","md_shard1_3"]}},
  "response":[
    "collection","backup_test",
    "numShards",1,
    "backupId",3,
    "indexVersion","8.9.0",
    "startTime","2021-06-28T14:01:38.398621Z",
    "indexSizeMB",0.015],
  "deleted":[[
      "startTime","2021-06-28T13:56:09.437719Z",
      "backupId",0,
      "size",5223,
      "numFiles",18]],
  "collection":"backup_test"}

$ curl 'http://localhost:8983/solr/backup_test/update?commit=true&indent=true' --data-binary @os_4.json -H 'Content-Type: application/json'

$ curl 'http://localhost:8983/solr/admin/collections?omitHeader=true&action=BACKUP&name=backup1&collection=backup_test&location=/tmp/solr_backup&incremental=true&maxNumBackupPoints=3'
{
  "success":{
    "127.0.1.1:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":3},
      "response":[
        "startTime","2021-06-28T14:02:05.875017Z",
        "indexFileCount",69,
        "uploadedIndexFileCount",18,
        "indexSizeMB",0.019,
        "uploadedIndexFileMB",0.005,
        "shard","shard1",
        "endTime","2021-06-28T14:02:05.878413Z",
        "shardBackupId","md_shard1_4"]}},
  "response":[
    "collection","backup_test",
    "numShards",1,
    "backupId",4,
    "indexVersion","8.9.0",
    "startTime","2021-06-28T14:02:05.873449Z",
    "indexSizeMB",0.019],
  "deleted":[[
      "startTime","2021-06-28T14:00:11.903010Z",
      "backupId",1,
      "size",5223,
      "numFiles",18]],
  "collection":"backup_test"}

各バックアップの応答に backupId が記載されています。リストアの際に backupId を指定することで特定のバックアップポイントに戻すことができます。

4回目のバックアップの応答に deleted という項目があります。maxNumBackupPoints=3 を指定しているので、4回目のバックアップでは最初のバックアップが削除されてバックアップポイントが3個に保たれていることが分かります。

ここでインデックスの状態を確認します。ドキュメントを1件ずつ4回投入したので、全件検索すると4件ヒットします。

$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":4,"start":0,"numFoundExact":true,"docs":[]
  }}

リストア

LISTBACKUP アクションで現在利用可能なバックアップポイントを確認できます。

$ curl 'http://localhost:8983/solr/admin/collections?omitHeader=true&action=LISTBACKUP&name=backup1&location=/tmp/solr_backup'
{
  "collection":"backup_test",
  "backups":[{
      "indexFileCount":0,
      "indexSizeMB":0.0,
      "shardBackupIds":{"shard1":"md_shard1_2.json"},
      "collection.configName":"backup_test",
      "backupId":2,
      "collectionAlias":"backup_test",
      "startTime":"2021-06-28T14:00:36.137745Z",
      "indexVersion":"8.9.0"},
    {
      "indexFileCount":0,
      "indexSizeMB":0.0,
      "shardBackupIds":{"shard1":"md_shard1_3.json"},
      "collection.configName":"backup_test",
      "backupId":3,
      "collectionAlias":"backup_test",
      "startTime":"2021-06-28T14:01:38.398621Z",
      "indexVersion":"8.9.0"},
    {
      "indexFileCount":0,
      "indexSizeMB":0.0,
      "shardBackupIds":{"shard1":"md_shard1_4.json"},
      "collection.configName":"backup_test",
      "backupId":4,
      "collectionAlias":"backup_test",
      "startTime":"2021-06-28T14:02:05.873449Z",
      "indexVersion":"8.9.0"}]}

ドキュメント2件のときのバックアップ(backupId=2)に戻します。

$ curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=backup1&collection=backup_test&location=/tmp/solr_backup&backupId=2'
{
  "responseHeader":{
    "status":0,
    "QTime":599}}
tfukui@deskmini:~/dev/splout_blog/sample_data$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[]
  }}

ドキュメント3件のときのバックアップ(backupId=3)に戻します。

$ curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=backup1&collection=backup_test&location=/tmp/solr_backup&backupId=3'
{
  "responseHeader":{
    "status":0,
    "QTime":624}}
tfukui@deskmini:~/dev/splout_blog/sample_data$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":3,"start":0,"numFoundExact":true,"docs":[]
  }}

既存のコレクションに対するリストアができるようになったことと併せて、コレクションのバックアップ機能がより使いやすくなりました。

[Solr]既存のコレクションに対してバックアップをリストアできるようになりました

はじめに

Solr 8.8 までは、SolrCloud で取得したバックアップをリストアするときには新しいコレクションを指定しなければならないという制限がありました。したがって、クライアントに公開するコレクション名をエイリアスで運用して、バックアップのリストア後にエイリアスを切り替えるという工夫が必要でした。

Solr 8.9 からは既存のコレクションに対してリストアできるようになりました。

バックアップ

curl 'http://localhost:8983/solr/admin/collections?action=BACKUP&name=backup1&collection=backup_test&location=/tmp/solr_backup'

backup_test というコレクションのバックアップを /tmp/solr_backup 以下に作成します。リストア時に参照するときの名前は backup1 です。

リストア(8.8 までの場合)

既存のコレクション名 backup_test を指定するとパラメータエラーとなります。

$ curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=backup1&collection=backup_test&location=/tmp/solr_backup'
{
  "responseHeader":{
    "status":400,
    "QTime":0},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Collection 'backup_test' exists, no action taken.",
    "code":400}}

新しいコレクション名 new_backup_test を指定すると、そのコレクションが作られてからリストアされます。

$ curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=backup1&collection=new_backup_test&location=/tmp/solr_backup'
{
  "responseHeader":{
    "status":0,
    "QTime":612},
  "success":{
    "127.0.1.1:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":238},
      "core":"new_backup_test_shard1_replica_n1"}}}

リストア(8.9の場合)

既存のコレクション名 backup_test を指定してリストアできます。

試しに、バックアップ後にインデックスの内容を全部消去します。

$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":9238,"start":0,"numFoundExact":true,"docs":[]
  }}

$ curl -s 'http://localhost:8983/solr/backup_test/update?stream.body=*:*&wt=json&commit=true'
{
  "responseHeader":{
    "rf":1,
    "status":0,
    "QTime":2}}

$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
  }}

その後、既存のコレクション backup_test を対象にしてリストアします。8.8までとは違ってエラーになりません。

$ curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=backup1&collection=backup_test&location=/tmp/solr_backup'
{
  "responseHeader":{
    "status":0,
    "QTime":977}}

全件検索したときの件数が元に戻りました。

$ curl -s 'http://localhost:8983/solr/backup_test/select?omitHeader=true&q.op=OR&q=*%3A*&rows=0'
{
  "response":{"numFound":9238,"start":0,"numFoundExact":true,"docs":[]
  }}

既存のコレクションに対するリストアの仕組み

SOLR-1587 によると、使用中である可能性もある既存のコレクションに対するリストアを実現するために、Solr 8.1 で追加された Read-Only モードが使われているそうです。

指定されたコレクションを Read-Only モードにセットする
コレクションの各シャードにバックアップをリストアする
コレクションの Read-Only モードを解除する

[Solr]TextProfileSignatureによるDe-Duplication

はじめに

前回の記事で取り上げた De-Duplication ではハッシュの計算方法として、厳密には一致しなくてもほぼ同内容のドキュメントを同一として扱うためのTextProfileSignature が利用できます。Solr のドキュメントでは以下のように書かれています。

Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text.
https://solr.apache.org/guide/8_8/de-duplication.html

どのくらい Fuzzy でも大丈夫なのか興味があったので調べてみました。

TextProfileSignature クラス

TextProfileSignature クラスの JavaDoc に詳しい説明がありました。

文字と数字以外を取り除いて小文字に統一する
ソースを見ると、この判定には Character.isLetterOrDigit() が使われています。
空白区切りでトークンに分割する
MIN_TOKEN_LEN(デフォルト2)より短いトークンを捨てる
各トークンの出現回数をカウントする
足きり用の QUANT を計算する。QUANT = QUANT_RATE * 最頻出のトークンの出現回数 (QUANT_RATEのデフォルト0.01)
QUANT が2より小さい場合は QUANT = 2 とする。ただし、2回以上出現したトークンが存在しない場合は QUANT = 1 とする。
- すべてのトークンが1回ずつしか出現しなかった場合は足きりせず全部使うということ
- ソースを見ると QUANT_RATE * 再頻出のトークンの出現回数を四捨五入している。つまり、QUANT_RATE がデフォルトの 0.01 であれば、再頻出のトークンの出現回数が250までは QUANT = 2 (1回しか出現しないトークンは捨てられる)となる。
QUANT よりも小さい出現回数のトークンを捨てる
残ったトークンを出現回数順に並べて MD5 ハッシュを計算する

ちなみに、空白文字で区切ってトークンを作るという処理なので、日本語のドキュメントにはあまり有効ではなさそうで、日本語ドキュメントで曖昧な De-Deplication をするためには、Tokenizer と連携する ProfileSignature を実装する必要がありそうです。

実験

実験のため、TextProfileSignature を呼び出す簡単なプログラムを作りました。

短いドキュメントでも効果がわかりやすいように、QUANT_RATE は 1 としています。これなら、再頻出のトークンの出現回数が2ならQUANTは2、再頻出のトークンの出現回数が2ならQUANTは3となります。

'I have an apple'  8b821c9e763bb2fc567d473996cfde4a
'I have an apple.' 8b821c9e763bb2fc567d473996cfde4a

記号の有無はハッシュ値に影響を与えません。

'an apple I have' 8b821c9e763bb2fc567d473996cfde4a

トークンの出現回数が同じなら、語順はハッシュ値に影響を与えません。

'I have the apple' 9526cdfcde3ddfad02a0691d564f30ac

トークンが別のものに変わるとハッシュ値も変化します。

'I have apple. I have apple.' 5d5a0ce2d6dc15618d873d5572c4eb5e
'I have a apple. I have the apple.' 5d5a0ce2d6dc15618d873d5572c4eb5e

QUANTが2になるので、1回しか出現しない ‘a’ ‘the’ の有無はハッシュ値に影響を与えません。

'I have an apple. I have an apple. I have the apple.' d95062c38e38e90b1c34b009bf434cda
'I have the apple. I have the apple. I have an apple.' d95062c38e38e90b1c34b009bf434cda

QUANTが3になるので、2回しか出現しない ‘a’ ‘the’ の有無はハッシュ値に影響を与えません。

[Solr]同じ内容のドキュメントの重複を防ぐ(De-Duplication)

はじめに

Solrでは基本的にIDフィールドの値でドキュメントを区別しているため、IDが異なれば同じ内容のドキュメントでも別々にインデックスされます。同じ内容のドキュメントの重複を防ぎたい場合はDe-Duplicationの機能を利用します。

De-Duplication の設定

De-Duplication を利用するためには、updateRequestProcessorChain に SignatureUpdateProcessor を組み込みます。

ここでは例として大阪の施設情報を利用します。以下のような文書構造になっています。

[
  {
    "id": "官公庁!1",
    "type": "官公庁",
    "area": "住之江区",
    "name": "軽自動車検査協会大阪主管事務所",
    "address": "住之江区南港東3-4-62"
  }
]

solrconfig.xml に以下を追加します。

   <updateRequestProcessorChain name="dedupe">
     <processor class="solr.processor.SignatureUpdateProcessorFactory">
       <bool name="enabled">true</bool>
       <str name="signatureField">signature</str>
       <bool name="overwriteDupes">true</bool>
       <str name="fields">type,name</str>
       <str name="signatureClass">solr.processor.Lookup3Signature</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>

  <requestHandler name="/update" class="solr.UpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.chain">dedupe</str>
    </lst>
  </requestHandler>

SignatureUpdateProcessor は、指定されたフィールドのハッシュ値を計算して一致すれば同一ドキュメントとみなすという動きになります。以下の3種類から選んで signatureClass プロパティで指定します。

MD5Signature
- 128ビットのハッシュ
Lookup3Signature
- 64ビットのハッシュ。MD5Signatureよりも高速
TextProfileSignature
- 多少の曖昧さを許す

fieldsプロパティで、どのフィールドが同じなら同一のドキュメントとみなすかを指定します。
上の例では type と name が同一なら同じドキュメントとしました。

signatureField はハッシュ値を格納するフィールドを指定するものです。

overwriteDupes をtrue に設定すると、ドキュメントが同一と判定された場合に新しい方で古い方を上書きします。

実行例

上記の設定をした状態で以下のドキュメントをインデックスします。

[
  {
    "id": "官公庁!1",
    "type": "官公庁",
    "area": "住之江区",
    "name": "軽自動車検査協会大阪主管事務所",
    "address": "住之江区南港東3-4-62"
  }
]

検索結果は以下の通りです。signature フィールドが自動的に付与されています。

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"官公庁!1",
        "type":"官公庁",
        "area":"住之江区",
        "name":"軽自動車検査協会大阪主管事務所",
        "address":"住之江区南港東3-4-62",
        "signature":"e3e630e5c046e6d3",
        "_version_":1701175515816132608}]
  }}

次に以下のドキュメントをインデックスします。名前と種別は同じで住所が変更になったという設定です。

[
  {
    "id": "官公庁!2",
    "type": "官公庁",
    "area": "港区",
    "name": "軽自動車検査協会大阪主管事務所",
    "address": "港区築港4-10-3"
  }
]

2番目のドキュメントをインデックスした後の検索結果は以下の通りです。

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"官公庁!2",
        "type":"官公庁",
        "area":"港区",
        "name":"軽自動車検査協会大阪主管事務所",
        "address":"港区築港4-10-3",
        "signature":"e3e630e5c046e6d3",
        "_version_":1701175675284619264}]
  }}

期待通り上書きされています。idも新しいものになっています。

ちなみに、overwriteDupes = false で設定した場合には以下のようになりました。

{
  "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"官公庁!1",
        "type":"官公庁",
        "area":"住之江区",
        "name":"軽自動車検査協会大阪主管事務所",
        "address":"住之江区南港東3-4-62",
        "signature":"e3e630e5c046e6d3",
        "_version_":1701175515816132608},
      {
        "id":"官公庁!2",
        "type":"官公庁",
        "area":"港区",
        "name":"軽自動車検査協会大阪主管事務所",
        "address":"港区築港4-10-3",
        "signature":"e3e630e5c046e6d3",
        "_version_":1701175519813304320}]
  }}

「上書きしない」というのは古い方のドキュメントがそのまま残るのかと思っていましたが、同じシグネチャのドキュメントが重複してインデックスされるということでした。