使用 Lucene 統計分類結果 (Using Lucene to count results in categories)


問題描述

使用 Lucene 統計分類結果 (Using Lucene to count results in categories)

我正在嘗試使用 Lucene Java 2.3.2 來實現對產品目錄的搜索。除了產品的常規字段外,還有一個名為“類別”的字段。一個產品可以屬於多個類別。目前,我使用 FilteredQuery 在每個類別中搜索相同的搜索詞,以獲取每個類別的結果數。

這會導致每個查詢進行 20‑30 次內部搜索調用以顯示結果。這大大減慢了搜索速度。有沒有更快的方法使用 Lucene 實現相同的結果?


參考解法

方法 1:

Here's what I did, though it's a bit heavy on memory:

What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.

Here's the code to create the bit sets:

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.

Now, when it's time to count categories of search results you do this:

public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}

What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

方法 2:

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

with this:

int numDocs = terms.docFreq()

and then get rid of the td variable altogether. This should make it even faster.

方法 3:

You may want to consider looking through all the documents that match categories using a TermDocs iterator.

This example code goes through each "Category" term, and then counts the number of documents that match that term.

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;


    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " ‑‑> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

This code should run reasonably fast even for large indexes.

Here is some code that tests that method:

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}

方法 4:

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

方法 5:

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?

Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.

Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

(by SachinitsadokRowanMatt QuailYuval FRowan)

參考文件

  1. Using Lucene to count results in categories (CC BY‑SA 2.5/3.0/4.0)

#java #filtering #lucene #catalog #search






相關問題

電子郵件地址中帶有 + 字符的 Java 郵件 (Java mail with + character in email address)

如何快速原型化 Java 代碼? (How to quickly prototype Java code?)

如何使用 Maven 在目標(SVN-)服務器上創建 Javadoc? (How to create Javadoc on the target (SVN-) server using Maven?)

為什麼檢查二叉樹有效性的解決方案不起作用? (Why the solution for checking the validity of binary tree is not working?)

Selenium webdriver通過第一個數字找到texy (Selenium webdriver find texy by first digits)

setOnClickListener 沒有在圖像視圖上被調用 (setOnClickListener is not getting called on image view)

繪製多邊形:找不到錯誤 (Drawing Polygon : unable to find error)

半透明 JButton:對像出現在背景中 (Semi-Transparent JButton: Objects appear in Background)

比較同一數組的元素 (Compare elements of the same array)

Java 屏幕截圖小程序 (Java screen capture applet)

Minecraft 1.8.9 Forge Modding 的Java 開發工具包,需要什麼JDK/JRE,代碼是否正確? (Java Development Kit with Minecraft 1.8.9 Forge Modding, What JDK/JRE Is Needed, Is Code Correct?)

java while (resultset.next()) 不返回同一列中的所有數據 (java while (resultset.next()) does not return all data in the same column)







留言討論