Research Interests ( ChengXiang Zhai)

I. Text Information Management

Text information plays a very important role in our lives. Web pages, email messages, scientific literature, and office documents are good examples of text information that we encounter all the time. With the dramatic increase in online information in recent years, management of text information is becoming increasingly important; for example, Web search engines are now being used by all of us routinely to find information on the Web. The huge amount of information presents both challenges and opportunities. The challenge is how to manage large amounts of information effectively and efficiently so that we can easily find useful information. The opportunity is the possibility of exploiting statistical inference to discover knowledge ("hidden patterns"). Correspondingly, I am interested in two broad directions -- intelligent information access (to address the challenge) and text data mining (to exploit the opportunity).

There are two modes of information access -- "pull" and "push", depending on whether the user initiates the process. In the pull mode, a user searches for information by using a search engine (e.g., Google) or browes information items through structures available on the information space (e.g., Yahoo directory). In the push mode, an information management system keeps track of a user's interest and recommends any relevant incoming information items to a user. My specific interests in intelligent information access are centered on information retrieval (leading to better search engine technologies such as personalized search), information organization (creating structures to assist a user in browsing), and information filtering (i.e., information recommendation).

In text data mining, I am especially interested in comparative text mining, which is concerned with extracting common and unique themes from a set of comparable text collections. Depending on the sets to compare, comparative text mining potentially covers spatiotemporal text mining, cross-language text mining, novelty detection, and many other interesting text mining problems as special cases and has many applications such as opinion summarization, business intelligence, text federation, and customer relationship management.

I believe that natural language processing is crucial to all kinds of text management tasks, and I am especially interested in the development of algorithms that exploit language technologies, such as statistical language models (i.e., probabilistic models of text).

I have a broad interest in all kinds of applications of text information management, such as Web search, digital libraries, and email management.

II. Bioinformatics

Bioinformatics is an emerging interdisciplinary research area where computational methods are applied to exploit various kinds of biological data and information to help biology research. The field of molecular biology has been generating huge amounts of biological data/information at a very fast speed, including DNA sequences, gene expression data , and protein sequences . A major challenge is how to manage and make effective use of such data as well as the huge amount of biology literature information.

I am becoming more and more interested in bioinformatics for several reasons. First, it is obviously an excellent application domain of text information management techniques. Second, the methods used in bioinformatics (e.g., Hidden Markov Models) tend to be similar to those used for processing text. Third, bioinformatics is fast growing and presents many interesting and challenging new computational problems. Finally, and most importantly, I like the fact that research in bioinformatics brings computer science closer to scientify discovery.

My current interests in bioinformatics include (1) Biology literature analysis: the goal is to use natural language processing and text mining techniques to extract useful information from literature that can benefit a biologist either directly or indirectly through combination of literature analysis and other biological data analysis. (2) Gene regulatory pattern analysis : the goal is to use machine learning and data mining techniques to find regulatory motifs and other TF binding site characteristics in the upstream subsequences of co-regulated genes to understand gene functions and regulations. (3) Massive protein motif analysis : the goal is to build a dictionary of "elementary" motifs with structural and functional information through statistical analysis of protein sequences and Gene Ontology annotations.

More about me ...