Biological data quality is of growing concern for the biological data repositories. If one looks at the larger data repositories, he or she can see through their statistics a pattern of growth that is astounding. For example, approximately nine million new sequences were submitted to GenBank in 2003. This growth, combined with previous years shows a pattern where the amount of biological information being stored will be growing at a near exponential rate.
With this enormous amount of data, data quality issues are continually arising. Many data repositories have instituted projects for integrating or synchronizing their repository with other repositories. Others look at how to effective curate their data with minimal curator interaction. Moreover, others look to extract even more knowledge from the data already residing in these repositories.
However, on fundamental levels, there are still many biological data quality issues that need to be addressed that hamper integration, knowledge discovery and automated curating processes. Initially, due to the very size of these repositories, integration, knowledge discovery and curation become difficult problems. Stemming from this, many databases have data discrepancy and error problems. These problems can be anything from typical database problems such as spelling errors and the lack of data for important fields within the data model. Moreover, these repositories have problems unique to them, such as the data-specific synonymy and polysemy problems. Biological data specifically has a number of specialized problems. For example, the use of non-standard nomenclature occurs frequently in phylogenetic database and protein databases. There are also inaccuracies concerning integrating legacy data with new data. Some repositories have existed for decades, with legacy data reflecting the knowledge of the time period it was submitted during. Over the course of the existence of the database, the knowledge within the field has broadened, thereby creating a need for the data to reflect this knowledge. Therefore curators, especially curators for non-redundant databases as oppose to the archival data repositories, have the task of unifying the database. From this data, most repositories also want to develop and store metadata. For example, with protein data, most repositories classify the data into motifs, families or superfamilies. However, since this is a complex process, usually a significant part of this curation is done manually. All of these problems are also compounded by the complexity of the data. Unlike more traditional data sets, even experts cannot review the data quickly to ensure that the content is correct. Usually a complex record would need to be reviewed extensively by a domain expert in order to ensure quality. However, due to the size of the data set, this option is not feasible.Background
Since my third year of undergraduate studies, I have been participating in organized research. In my undergraduate studies, I had the opportunity to intern at AT&T Research concerning two different programs. The first research experience concerned feature interactions in telephony. I had the opportunity to study the features of the telephony network, for example call waiting, and observe the undesired interactions these features can cause because of improper specification. My work was then to try to find methods for minimizing these interactions. My second project concerned data quality and database warehousing.
During my time at NJIT, I have participated in three significant research projects. The first project, ATreeGrep, investigated finding topologically similar trees within phylogenetic databases. This algorithm, primarily developed by Dr. Jason T.L. Wang, Dr. Dennis Shasha, and Dr. Kaizhong Zhang, exploits the properties of data modeled as unordered trees to find a given substructure within the tree. Phylogenetic data is data that models the evolutionary relationships between evolutionary units such as species or population groups.
Another project I contributed to concerned MPEG-7 and distance learning technologies. Essentially, an MPEG-7 is a standard developed by the Motion Picture Expert Group (MPEG) to annotate multimedia data. This annotation specifies every possible enumerable aspect of multimedia data. Since the current state of the art in multimedia information retrieval requires some interaction with annotation, MPEG-7 attempts to standardize this annotation. From this research, I developed with Vincent Oria and Viswanath Neelavalli a tool to create an MPEG-7 for a distance learning course.
My work on ATreeGrep led to my dissertation research. During this research, I observed numerous discrepancies within the data that would skew results for the tool. This resulted in my investigation into biological databases and data quality. I looked at the problems World Wide Web data repositories have in maintaining their data quality at an optimum level. Biological data quality is a difficult problem that has unique issues that are characteristic of the data set. Since biological data consists of many complex data models that must interact with each other, and these data models consist of heterogeneous data, the problem has many research issues to consider. Based on the observations concerning biological data quality, my dissertation focuses on developing a framework that can be imposed upon biological databases to help improve the quality of this data. This framework, called BIO-AJAX, works with the biological data to preserve the experimental results while also creating a more consistent environment for the data. The framework has been implemented upon an evolutionary database, TreeBASE.Summary of Research Plan
My research plan for the next few years is to focus on data quality problems within biology data repositories, using my experiences during my dissertation research to facilitate exploring data quality issues in phylogenetics and proteomics. For my dissertation, I applied BIO-AJAX evolutionary databases. This project has completed implementation on TreeBASE and the results show significant improvement upon the TreeBASE data retrieval. Since I have already shown how it can improve information retrieval and knowledge discovery in this data set, the next step will be to try to apply BIO-AJAX to a larger and more complex data set, such as protein data. In the next few years I plan to explore these options as well as any other productive research areas related to biological data quality. This research will consist of continuing the partnerships I have been developing during my graduate studies as well as initiating new partnerships. The research will maintain investigating exploratory data mining as a means for biological data cleaning as well as examining new biological data quality issues. Currently, I am exploring various funding opportunities for this work and will continue to do so as this project progresses.
I am also open to working with any individuals on research projects where I can make a contribution. In my time at NJIT, I have worked on a number of research projects different times. The topics have included data mining research, database warehouse management research, multimedia database research and bioinformatics research.