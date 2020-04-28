A month after the White House launched an effort that brought together technologists and artificial intelligence experts to scour the world’s repository of medical literature for insights on COVID-19, researchers last week reported that journals and publications designed to be read by people are posing challenges for computers to sift through.

Most scientific papers published and distributed using the PDF format are “not amenable to text processing,” the group of 24 researchers wrote in a paper published last week. While “the file format is designed to share electronic documents faithfully for reading and printing” they pose challenges for “automated analysis of document content.”

The White House Office of Science and Technology Policy on March 16 announced a collaborative venture that included the National Library of Medicine, which is part of the National Institutes of Health, Microsoft, Allen Institute for AI, Georgetown University’s Center for Security and Emerging Technology, the Chan Zuckerberg Institute (named for Mark Zuckerberg, Facebook's founder, and his wife Priscilla Chan), and Kaggle, which is a unit of Google.

The goal was to assemble a dataset of tens of thousands of scientific papers and literature on the coronavirus that would be examined using machine language and text processing programs to find patterns. That would help researchers rapidly answer questions raised by the World Health Organization and U.S. agencies about the pandemic.

Scientists all over the world have been working and publishing their findings on various strains of coronavirus over the years, including other variants such as SARS, MERS, and the latest, COVID-19. The application of artificial intelligence tools to look for commonalities and differences among the thousands of such published articles could help the scientists spot things they may have missed.