Sign In

Jaffar Muneer Atwan

PhD Abstract

Arabic is a highly inflected language and complex in morphology. In addition, lack of knowledge about the structure of searching text in Arabic querying has contribute to the difficulty in Arabic document retrieval. Moreover, Arabic is a polysemous language which means that the word has several meanings. The conventional Information Retrieval (IR) process consists of four primary phases. Namely, pre-processing,tokenization, indexing, querying, and retrieving results. Several drawbacks have been recorded in some phases of the current Arabic IR framework. In the pre-processing phase, for Arabic IR there is no enhanced list of stopwords or normalization steps.

Furthermore stemming algorithms do not support the semantics of the stemmed word

which cause ambiguity. In query processing phase, there are several drawbacks in user query stage such as misspelled and ambiguous vocabulary. Additionally, user query has general concept and the documents are gathered using a different vocabulary. Another drawback of query processing phase, the adaptation of Query Expansion (QE) approach(s) may produce worse ranking or irrelevant results. The main objective of this study is to enhance an Arabic retrieval framework by improving the conventional IR retrieval processes. Other objectives of this study include introducing genarat stopwords list in the pre-processing level and investigating several Arabic stemmers. Moreover, in terms of ambiguity problem, Arabic WordNet has been utilized in two level corpus and query expansion. Point-wise Mutual Information (PMI) corpus-based measure was used to select the synonym semantically from WordNet. Nevertheless, to improve our model performance, Automatic Query Expansion (AQE) and Pseudo Relevance Feedback (PRF) have been explored. Eventually, adopting semantic information to PRF is investigated. A multi-method research is used to achieve the objectives. An enhanced Arabic IR framework was built and evaluated using TREC 2001 data. The phases of the research method include theoretical study, designing of the proposed techniques, prototyping and empirical evaluation of the techniques. A combined stop-words list is used to reduce the corpus size. Evaluation results showed that the combined list successfully reduced the corpus size and increased the precision. The use of Arabic

WordNet for building semantic relationship between query and corpus in two levels

corpus and query level is a new technique. Additionally, the experimental results of proposed techniques for Arabic IR shows that the use of Arabic WordNet with corpus and query level with AQE and the adaptation of PMI in expansion process has

successfully reduced the ambiguity by selecting the most appropriate synonym. The

enhanced AIR framework has demonstrated an improvement by 49% in terms of Mean Average Precision, with an increase of 7.3 % in recall in comparison to the baseline

framework.



​​

Contact us

Dr. Jaffar Muneer Atwan

Latest News