Researchers are turning to text mining tools as one way to more efficiently and accurately extract information from large volumes of scientific data.

With the proliferation of scientific data being produced on a daily basis, researchers are often faced with the challenge of sifting through large

databases to locate information that could lead to new medical treatment options.

That was the main topic of discussion at a workshop organized by the Ontario Centre for Genomic Computing (OCGC), which brought together bioinformaticists and text-mining experts from Ontario and abroad.

Retrieving made relevant

Through the Centre for Computational Biology (CCB), the OCGC provides bioinformatics expertise and supercomputing services to academic and commercial researchers at The Hospital for Sick Children and across Ontario.

Hagit Shatkay, assistant professor at Queen’s University’s School of Computing, discussed how researchers can utilize information retrieval methods to rapidly and effectively survey literature.

The goal of information retrieval is to “”retrieve only the documents that satisfy the needs of the user,”” Shatkay said. She added the classic model of information retrieval is Boolean, a combinational system created by George Boole that combines prepositions with the logical operators and, or, if, then, except and not.

Another model is a probability similarity search, which calculates what the probability of the document is as it relates to the query. For example, if a user searched coffee, it would pull up the words acidity and coffee as the most likely to occur and ginger and squash as not likely to occur, said Shatkay.

Shatkay said there are different challenges associated with searching for biology documents. “”In general, they are not written for the public,”” said Shatkay. “”They are also not written to be understood and are not written by professional writers. Lastly, they are not usually written by people who speak English as their first language.””

Joel Martin, group leader of interactive information at National Research Council Institute for Information Technology (NRC-IIT), one of 20 institutes and national programs at the NRC in Ottawa, said more than 40,000 articles are published in scientific literature each month. Geneticists can’t possibly read all the articles that pertain to their field.

To address this problem, NRC-IIT in conjunction with the NRC Institute for Biological Sciences (NRC-IBS) and the Samuel Lunenfeld Institute and Blueprint International is developing a collection of text and language-based processing tools. Currently in its first phase, the LitMiner project, which Martin calls “”content management for biology,”” aims to integrate several existing text mining tools into one package. LitMiner is supported by one server running a mix of home grown NRC-IIT and MY SQL databases.

Martin said the institute chose an open source database to maintain flexibility as it may want to make all of its software open source in the future.

Share on LinkedIn Share with Google+