A month after the White House launched an effort that brought together technologists and artificial intelligence experts to scour the world’s repository of medical literature for insights on COVID-19, researchers last week reported that journals and publications designed to be read by people are posing challenges for computers to sift through.
Most scientific papers published and distributed using the PDF format are “not amenable to text processing,” the group of 24 researchers wrote in a paper published last week. While “the file format is designed to share electronic documents faithfully for reading and printing” they pose challenges for “automated analysis of document content.”
The White House Office of Science and Technology Policy on March 16 announced a collaborative venture that included the National Library of Medicine, which is part of the National Institutes of Health, Microsoft, Allen Institute for AI, Georgetown University’s Center for Security and Emerging Technology, the Chan Zuckerberg Institute (named for Mark Zuckerberg, Facebook's founder, and his wife Priscilla Chan), and Kaggle, which is a unit of Google.
The goal was to assemble a dataset of tens of thousands of scientific papers and literature on the coronavirus that would be examined using machine language and text processing programs to find patterns. That would help researchers rapidly answer questions raised by the World Health Organization and U.S. agencies about the pandemic.
Scientists all over the world have been working and publishing their findings on various strains of coronavirus over the years, including other variants such as SARS, MERS, and the latest, COVID-19. The application of artificial intelligence tools to look for commonalities and differences among the thousands of such published articles could help the scientists spot things they may have missed.
Tens of thousands of papers
The project began with about 29,000 papers and journal articles from researchers around the world and soon grew to more than 50,000 papers in the subsequent few weeks, the researchers from all of the groups involved in the effort wrote. The database is known as CORD-19.
“There is however, significant work that remains” on what methods work best to search and analyze the text, how to involve biomedical experts in the process, and how to take useful results from the analysis and convert them into COVID-19 treatments and management of the pandemic, the researchers said.
“If you see a PDF document, it’s great for reading and human consumption but for a machine to understand the content, you have to extract from the different portions of a PDF and tell the machine here’s column one and here’s column two and here’s where a figure is with respect to the text,” Lucy Wang, one of the lead authors of the paper said in an interview. “Turns out it is not easy to do that.”
Researchers also are grappling with collecting accurate metadata on publications in a form that computers can ingest, Kyle Lo, another author of the paper said. Metadata includes year of publication, authors of a paper, and peer reviewers, which often are signals to other scientists on whether a publication is weighty enough for consideration in their own work.
“If a paper was published 20 years ago,” and is broadly about the whole class of coronavirus it may not be immediately useful to answer questions on the current SARS CoV2, which is the novel coronavirus that is causing COVID-19, Lo said.
Publishers of scientific journals are helping gather and share their publications with the research group, including pre-publication papers, but each publisher may use a different format for their metadata that makes it difficult for text processing algorithms to sift through, Lo and Wang said.
Wang and Lo are researchers at Allen Institute for AI, co-founded by Paul Allen, who was a co-founder of Microsoft.
Speed is the goal
While literature reviews typically take months or longer to find pertinent scientific articles, and then extract, summarize and find useful insights in them, “here we are trying to automate this process to generate these results very quickly,” Wang said. “It remains to be seen how effective these systems will be but early results seem pretty promising.”
Kaggle, which brings together more than a 1 million data scientists from around the world, is holding a competition to generate algorithms that would extract information and findings from the articles to answer questions such as the incubation period for COVID-19 observed from around the world. They would then feed that to biomedical researchers who in turn would provide feedback to the data scientists on further questions.
PubMed Central, a free digital repository of scientific literature on biomedical and life sciences, at the National Library of Medicine, is leading the effort to gather all the scientific literature from around the world for the CORD-19 project.
The repository of articles currently is primarily drawn from sources in the United States, the United Kingdom, the European Union, and from East Asian countries. Chinese researchers also have published thousands of papers on COVID-19, having been the epicenter of the outbreak.
Still, there aren’t enough Chinese language papers in the database, especially those that were produced “during the early stages of the epidemic,” the researchers wrote. The database also is likely missing publications by government agencies, they wrote.
The National Library of Medicine is in the process of collecting some of the Chinese publications, Wang and Lo said. The National Library of Medicine did not immediately respond to questions on its efforts.
NIH is working with as many as 50 scientific publishers and journals that have agreed to make their COVID-19 and broader coronavirus articles freely available via PubMed Central in forms that support text mining and machine analysis, the agency said in an email.
The database includes about 2,000 foreign language articles in Chinese, German, Spanish, French, Italian, and others, NIH said. The bulk of the rest are in English, the agency said.
Some of the early results from analyzing the database cover topics such as virus transmission, incubation, and environmental factors affecting the disease, as well as risk factors for COVID-19, according to the researchers.