Lucene --open source text serch engine API(讲稿)

发表于:2007-06-22来源:作者:点击数: 标签:
/** * 这是一个关于Lucene的讲稿的txt格式。如果您需要pdf格式的可以 * 与我联系(pengjy@263 .net ) 。 * 作者:pengjy * 时间:2002-04 * keywords: lucene, api, token, index, chinese, unicode */ ................page 1 ................ Lucene an o

   
  /**
* 这是一个关于Lucene的讲稿的txt格式。如果您需要pdf格式的可以
* 与我联系(pengjy@263.net) 。
* 作者:pengjy
* 时间:2002-04
* keywords: lucene, api, token, index, chinese, unicode
*/

................page 1 ................

Lucene

an open source text search engine API
high-performance,
full-featured,pure java

Pengjy@262.net

................page 2 ................
Agenda

Overview
APIs
How dose Search Engine Work
Feature
For Chinese character

................page 3 ................
Overview

An Apache Jakarta Project
High-performance, full-featured
Open source text search engine APIs
Easy to use, fast to build your own search engine

................page 4 ................
Overview

Version 1.2 rc4
Applications using Lucence
2a.WebSearch
Jive Forums
RockyNewsgroup.org

................page 5 ................
APIs

org.apache.lucene.analysis
defines an abstract Analyzer API for converting
text from a java.io.Reader into a TokenStream,
an enumeration of Token's. A TokenStream is composed
by applying TokenFilter's to the output of a Tokenizer.
A few simple implemetations are provided, including
StopAnalyzer and the grammar-based StandardAnalyzer
(use JavaCC).

................page 6 ~9................
APIs

org.apache.lucene.document
provides a simple Document class. A document is
simply a set of named Field's, whose values may be
strings or instances of java.io.Reader.

org.apache.lucene.index
provides two primary classes: IndexWriter, which
creates and adds documents to indices; and IndexReader,
which clearcase/" target="_blank" >ccesses the data in the index.

org.apache.lucene.queryParser
uses JavaCC to implement a QueryParser

org.apache.lucene.search
provides data structures to represent queries
(TermQuery for individual words, PhraseQuery for phrases,
and BooleanQuery for boolean combinations of queries) and
the abstract Searcher which turns queries into Hits.
IndexSearcher implements search over a single IndexReader.

org.apache.lucene.store
defines an abstract class for storing persistent
data,the Directory, a collection of named files written
by an OutputStream and read by an InputStream. Two
implementations are provided, FSDirectory, which uses
a file system directory to store files, and RAMDirectory
which implements files as memory-resident data structures.

org.apache.lucene.util
contains a few handy data structures, e.g.,
BitVector and PriorityQueue.

................page 10 ................
How dose Search Engine Work

Create indices

input -->analyzer-->filters-->tokens-->indices
^
|
tokenize

................page 11 ~ 14 ................
How dose Search Engine Work

Store Indices
Rather than maintaining a single index, it builds
multiple index segments. For each new document indexed,
Lucene creates a new index segment.
It merges small segments with larger ones -- this
keeps the total number of segments small so searches remain
fast.

To prevent conflicts (or locking overhead) between
index readers and writers, Lucene never modifies segments
in place, it only creates new ones. When merging segments,
Lucene writes a new segment and deletes the old ones --
after any active readers have closed it.

A Lucene index segment consists of several files:
A dictionary index containing one entry for each 100 entries
in the dictionary A dictionary containing one entry for
each unique word A postings file containing an entry for
each posting

Since Lucene never updates segments in place, they
can be stored in flat files instead of complicated B-trees.
For quick retrieval, the dictionary index contains offsets
into the dictionary file, and the dictionary holds offsets
into the postings file.

Lucene also implements a variety of tricks to compress
the dictionary and posting files -- thereby reducing disk
I/O -- without incurring substantial CPU overhead.

................page 15 ~ 22 ................
Feature

Incremental indexing
Incremental indexing allows easy adding of documents to
an existing index. Lucene supports both incremental and batch
indexing.

Data sources
Lucene allows developers to deliver the document to the
indexer through a String or an InputStream, permitting the
data source to be abstracted from the data. However, with
this approach, the developer must supply the appropriate
readers for the data. Feature

Indexing control
Some search engines can automatically crawl through a
directory tree or a Website to find documents to index.
Since Lucene operates primarily in incremental mode, it lets
the application find and retrieve documents.

File formats
Lucene supports a filter mechanism, which offers a simple
alternative to indexing word processing documents, SGML
documents, and other file formats.

Content tagging
Lucene supports content tagging by treating documents
as collections of fields, and supports queries that
specify which field(s) to search. This permits semantically
richer queries like "author contains 'Hamilton' AND body
contains 'Constitution'".

Stop-word processing
Search engines will not index certain words, called stop
words.such as "a", "and," and "the". Lucene handles stop
words with the more general Analyzer mechanism, and provides
the StopAnalyzer class, which eliminates stop words from the
input stream.

Query features
Lucene supports a wide range of query features, including
all of those listed below:
Boolean queries; andqueries. return a "relevance" score
with each hit.
handle adjacency or proximity queries -- "search followed
by engine" or "Knicks near Celtics"
search on single keywords.
search multiple indexes at once and merge the results to
give a meaningful relevance score.

However, Lucene does not support the valuable "Soundex",
or "sounds like," query.

Concurrency
Lucene allows users to search an index transactionally,
even if another user is simultaneously updating the index.

Non-English support
As Lucene preprocesses the input stream through the
Analyzer class provided by the developer, it is possible to
perform language-specific filtering.

................page 23 ................
For Chinese character

JavaCC -- the Java Compiler Compiler.

build complex compilers for languages such as
Java or C++.
write tools that parse Java source code and perform
automatic analysis or transformation tasks.
EBNF (Extended Backus-Naur-Form)

................page 24 ................
For Chinese character

org.apache.lucene.analysis.standard.StandardTokenizer.jj

TOKEN : { // token patterns
| |)+ >
| "@"
("." )+ > //email adress

}

................page 25 ................
For Chinese character

Add Uincode CJK to StandardTokenizer.jj
< #UNICODECJK:
[
"u4e00"-"u9faf", //CJK Unified Ideographs
"u3400"-"u4dbf", //CJK Unified Ideographs Extension A
"u3000"-"u303f", //CJK Symbols and Punctuation
"u2e80"-"u2eff", //CJK Radicals Supplement
"u3200"-"u32ff", //Enclosed CJK Letters and Months
"ufe30"-"ufe4f", //CJK Compatibility Forms
"u3300"-"u33ff", //CJK Compatibility
"uf900"-"ufaff" //CJK Compatibility Ideographs
]>

................page 26 ................
For Chinese character

Add Unicode CJK
Build Lucene (use Lucene 1.2 src and Ant 1.4)
Test Windows 2000 server + weblogic 6.1 sp2 +
mssqlserver 2000 + jive2.2.3 + Lucene


................page 27 ................

Thank you!

My mail:pengjy@263.net

................The end ................

原文转自:http://www.ltesting.net