Inside a data warehouse of disease

MONTREAL — Turns out there is a little data warehouse in all of us.

There are billions, in fact, spread throughout our bodies in a network more complex than any system administrator could handle. But the files inside these data warehouses contain the life-and-death secrets that will serve as the foundation for the next generation of pharmaceuticals. The repositories in question are proteins, and here, at a brand-new facility just 10 minutes away from the Dorval airport in Montreal, a group of scientists are trying to map them out.

Caprion, a three-year-old biotechnology startup specializing in the sophisticated research called proteomics, opened its doors Thursday to a large audience of Canada’s scientific community and senior executives from its major vendor partners, including Sun Microsystems Inc., Oracle Corp. and CGI Group. The event was intended to showcase the company’s considerable IT resources and outline its proprietary protein-mapping platform, CellCarta.

Lloyd Segal, Caprion’s president, said the recently completed facility will use Sun’s SunFire servers to analyze data and Oracle software to create a data warehouse of protein information. CGI is the system integrator putting the solution together.

While there has already been considerable attention paid to the mapping of the human genome, there is much more work to be done in breaking down genes into their component parts, organelles, and examining the proteins inside them. Scientists estimate that there could be as many as 20 proteins for every gene, and we only understand the functions of about five per cent of them. Caprion’s staff takes tissues samples, isolates proteins and tries to look for patterns and variations that could indicate their purpose. This data can then help Caprion and its pharmaceutical partners to develop new drugs to combat disease.

“These guys are chopping off mountains and examining every grain of sand they come across,” Segal said.

While most biotechnology firms in this area “mash” proteins and manage a low-resolution image of their structure, Segal said the CellCarta platform spins out individual proteins and magnifies them. This will allow the company to examine the less-abundant proteins, he said. “This is like watching a football game from the sidelines as opposed to seeing it from the Goodyear blimp,” he said.

The server and database technology play a key role in accelerating the speed at which Caprion’s researchers can conduct their work. Jean-Francois Gorup, regional technical manager for Sun Microsystems of Canada’s Quebec region, said the traditional way of manually feeding protein information into a device called a spectrometer could take about two weeks to get results. By creating an algorithm that automates the process, Caprion will be able to achieve the same results in five minutes. Gorup said the project, which was officially announced in June, required considerable customization of the Sun hardware. That meant taking a crash course of sorts in proteomics.

“You have to learn the jargon,” he said. “We’re not trying to do Caprion’s job, and Caprion’s not trying to do our job, but at some point you have to be able to talk about what you need to do.”

Bill Bergen, president of Oracle Corp. Canada, said there was little customization on the database side. “Just on the application, to extract the data,” he said.

Segal said the size of data involved would equal 135 miles of floppy disks, but Sun and Oracle have set up the facility to grow over time from storing terabytes of data to petabytes. “We don’t care that our competitors are working with other hardware companies,” he said, adding that the current implementation puts Caprion on a par with pharaceutical firms, not just biotechnology firms. “I’m not required to say anything nice about (Sun Canada president) Everett (Anstey), but since we broke ground the systems, not just the physical building, are in place and running ahead of schedule.”

Caprion has designed its headquarters with expansion in mind. Many of the corridors, most of which still smell of fresh paint, are wide enough to accommodate more equipment. The laboratories where the proteins are extracted and analyzed are hidden from windows to avoid temperature variations, and come equipped with negative pressure doors that force air out when they are opened to keep the rooms dust-free.

The floor in the room where the spectrometers are housed sit on a different slab of concrete to protect it from vibrations. In the Sun server room, meanwhile, the floor has been raised to allow easy access should any cabling changes be required. The data farm is managed by about 20 Ph.D. and Master’s degree graduates with servers on each desktop.

CGI vice-president Michael Roach said Caprion represents one of the largest biotechnology projects in the world. “It’s a whole new wave of bioinformatics, and you’re seeing that it’s increasingly tied to IT,” he said. “We want to leverage what we’ve learned here and expand it into other engagements.”

Share on LinkedIn Share with Google+