Researchers are turning to text mining tools as one way to more efficiently and accurately extract information from large volumes of scientific data.
With the proliferation of scientific data being produced on a daily basis, researchers are often faced with the challenge of sifting through
large databases to locate information that could lead to new medical treatment options.
That was the main topic of discussion at a workshop organized by the Ontario Centre for Genomic Computing (OCGC) on Friday, which brought together bioinformaticists and text-mining experts from Ontario and abroad.
Through the Centre for Computational Biology (CCB), the OCGC provides bioinformatics expertise and supercomputing services to academic and commercial researchers at The Hospital for Sick Children and across Ontario. These include computing resources, development of bioinformatics tools and analyses and application development and desktop support.
The one-day workshop, the second organized by the OCGC, was held at The Fields Institute for Research in Mathematical Sciences, which sponsored the event along with the Ontario Research & Development Challenge Fund.
Hagit Shatkay, assistant professor at Queens University’s School of Computing, discussed how researchers can utilize information retrieval methods to rapidly and effectively survey literature, helping them explain and predict connections between genes and diseases, for example.
The goal of information retrieval is to “retrieve only the documents that satisfy the needs of the user,” Shatkay said. She added the classic model of information retrieval is Boolean, a combinational system created by George Boole that combines prepositions with the logical operators and, or, if, then, except and not. Medline, for example, employs the Boolean model to search its 16 million abstracts.
Another model is a probability similarity search, which calculates what the probability of the document is as it relates to the query. For example, if a user searched coffee, it would pull up the words acidity and coffee as the most likely to occur and ginger and squash as not likely to occur, said Shatkay.
Shatkay said while information retrieval models have been around for 50 years, there are different challenges associated with biology documents.
“In general, they are not written for the public,” said Shatkay. “They are also not written to be understood and are not written by professional writers. Lastly, they are not usually written by people who speak English as their first language.”
Joel Martin, group leader of interactive information at National Research Council Institute for Information Technology (NRC-IIT), one of 20 institutes and national programs at the NRC in Ottawa, said more than 40,000 articles are published in scientific literature each month. For this reason, Martin said geneticists can’t possibly read all the articles that pertain to their field.
“The geneticists are overwhelmed with the number of papers they have to read,” said Martin. “It’s easy to miss something.”
To address this problem, NRC-IIT in conjunction with the NRC Institute for Biological Sciences (NRC-IBS) and the Samuel Lunenfeld Institute and Blueprint International is developing a collection of text and language-based processing tools. Currently in its first phase, the LitMiner project, which Martin calls “content management for biology,” aims to integrate several existing text mining tools into one package. The project started two years ago.
There are currently between 10 to 20 geneticists using LitMiner with half using it on a regular basis. “It’s important to get started with a small number of users so we can get feedback,” said Martin, adding NRC-IIT is working towards getting 11,000 users.
“The number of users will depend on partnerships with other organizations.”
LitMiner is supported by one server running a mix of “home-grown” NRC-IIT and MY SQL databases. Martin said the institute chose an open source database to maintain flexibility as it may want to make all of its software open source in the future.
“If open source is the way to get as many users as possible, than that’s what we’ll do.”
Because of the low number of users, one of LitMiner’s goals is to make searches faster than much larger scientific search engines such as PubMed, which supports millions of users. If a user wanted to find a three-word phrase that is followed by an open parenthesis and a closed parenthesis, for example, it would take several hours using PubMed versus 400 milliseconds using LitMiner.
LitMiner has five tabs that users can switch between to perform searches, including its own engine.