My Reading...

Wednesday, March 29, 2006

Applying Dempster-shafer theory to Information Retrieving

1) modelling and evaluation
2) modelling and evaluation
3) modelling
4) modelling
5) learning
6) From IR to QA

Friday, March 17, 2006

more about query Expansion





















knowledge base--- knowlege based in this context is about terms would be used by search engine user, and map to terms used by search engine vendor, as well as ontology information.
KB = {[concept terms {[class belief] ]}*
here
concept--terms would be used by search engine users.
terms ---terms used by search engine vendor.
class --- ontology class.

Example,

class University {
Name:
Subjects:
Location:
}

One of possible knowlege Item about this is:
[concept="MIT" terms="Massachusetts Institute of Technology" class=University belief=0.98];

Thursday, March 16, 2006

useful links

[1]next generation of search engine
[2]Topic search Engine

terms decomposition and query formation

Without introducing new word, a query string could be analyzied according pre-defined knowledge.
For example, the orignal query could be 道德经. as we know that 德经 is book too.
so we could expand the query into 道德经 " 德经". By using google, the results are different.

Another example is
the original query=MIT Computer Science and Artifical Intelligence Laboratory.
and
the expanded query= MIT computer science and artificial intelligence laboratory "artificial intelligence laboratory"
returns different set of results though no extra word are introduced.

More importantly, if the term knowledge base could supply information like category or subcategory, such extra information could be incorported into the orignal query and help improve the precision of search late.




Wednesday, March 15, 2006

Query Expansion/post filtering

The need for query expansion or result filter.

The term collection preferred by users of searching engine could be quite different from the term collection pre-defined the documents. For example, a company named "crystal Jade Kitchen" is considered as "crystal Jade" by users of search engine.

One way is to break down the terms used in document into its minimal units, For example, instead of treat "crystal Jade Kitchen" as a term, break it down into 3 terms "crystal", "jade" and "Kitchen". But learn concepts (automatically or manully) and chain the small units of terms into big one, say "crystal jade" and form a concept <"crystal jade", restruant>, here the restruant is the class name for "crystal jade".

The key is to learn the terms used by users and mapped it into the term space in the collection of document.

Let td[i] 1<=i<=M is a set of terms used in the documents, while tq[j] 1<=j<=H. The tq[j]=concept[| m<=i<=M}]

references:

[1]Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concept
[2]
Query Reformulation with Collaborative Concept-Based Expansion

Terminology for search

Term
Term is the basic unit to form a document or a piece of message. Term is releted to domain the application is designed for and languages. Take Chinese for an example, 道德经 is a book, itself is term, while 道德,经 are two seperate terms too. A statement might consist of a collection of words which means nothing or different thing when seperates them into words or characters. For example, an organization's name typically is not a sequence of terms related to the domain. in this case, tagging mechanism is required to improve the search performance.

Vector Space Model (VSM)
Let t[i] (1<=i<=M) and d[j] (1<=j<=N) represent a term and a document in the database, where M is the number of terms and N is the number of documents. In the VSM, a document d(j) is represented as a M dimensional vector
d[j]=(w[1,j], ..., w[M,j]) where w[i,j] is a weight of term t[i] in a document d[j].

A query is represented as
q[k]=(w[1,k],....,w[i,k], ...., w[M,k]), 1<=k<=L, where w[i,k] is a weight of a term t[i] in a query q[k].
These weights are computed by the standard normalized tf*idf weighting scheme as follows:
w[i,j]=tf[i,j]*idf[i]. where tf[i,j] is the weight calculated using the term frequencey f[i,j] and idf[i] is the weight is weight calculated using the inverse of the document frequency (?).

The result of the retrieval is represented as a list of documents ranked according to their similarity to the query. The similiarity sim(d[j], q[k]) between a document d[j] and a query q[k] is measured by the standard cosine of the angle between d[j] and q[k]:
sim(d[j], q[k])=d[j] x q[k]/(||d[j]||*||q[k]||).

A problem of the standard VSM is that a query is often too short to rank documents appropriately, to cope with this problem, query expansion mechanism could be applied.

Tuesday, March 14, 2006

Vertical search Engine

Vertical search Engine

Vertical Search Engine is called local search Engine sometimes. They are designed to search through structured or semi-structured database and return the users search results. Such as
search local pizza hut, restruants.

The main challenge to veritcal search engine is that the query itself could be unstructured. The good thing about local search is that the data is structured or semi-structured, and hance the ontologies built-in could be applied to improve the search performance.

Even though the data is structured, structured data might contains unstructured data. For example, in the company table, the company name typicall is not well structured compared to company code.

The difficulties facing search unstructure data elements.
1) Synonym, a term or phrase has the similar meaning to another term.
2) Tranditional search technology could not capture the relationships between entities.
3) Co-occurence terms. User might simply omit the Co-occurence terms. This issue could be address by query expansion mechanism, by adding the Co-occurence terms collected manully or statistically to orignal query. However the expanded query might return unrelevant results and deteriorate the search precision.

A possible cure to those problems is applying domain dependent conceptural model. Ontologies are example of such conceptual models. In this context, ontology is a collection of concepts (similar with class in the object-oriented world) and their interrelationships which can collectively provide an abstract view of an application domain.

To some extend, the ontology is very similar with database schema which capture mainly parent and child relationship, while ontology is meant to capture any possible types of relationships.

However, the difficulty here is to "understand" , "parse" or 'tag' the input query string with the pre-defined ontology, namely ontology-based information extraction. In the IE(Information extraction) world, this is so-called named entity recognization.

There are serveral way to recognize a name entity.
1) maintain a list of string pair. For example ,
2) recognize a name entity by pattern matching. For example, Dr. SomeOne is a person. the pattern is Dr [.] String.
3) hybrid the first and the second method. In Chinese, the last name of person usually use a very special set of Chinese character, and typically 3 Chinese character for one person's name. By storing this set of characters for the last name, the rest of them could be calculated.

Vertical search usually needs deal with organization name which is not well structured. Especially, the query string is very like will use short form of the organization name. For this case, there is other option but keeping it in a list.

How ontology help vertical Search?

Providing Context to Web Searches:
The Use of Ontologies to Enhance Search Engine's Accuracy


co-occurence word could be added to original query, this is the query expanding process. Ontologies could be used for control the process of query expansion.

On the other hand, from ontology's perspective view, entities captured in the system are linked together. PageRank algorithm applied by google could be used, an iterative algorithm to determines' one entity's "importance" based upon the importance of its related entites.