New Standards and Open Access Can Help Natural Language Processing

A graphic representation of incorporating data from many sources

Clinical notes in medical records are rich sources of data about human health. But tapping them for medical research can be challenging because these data come from various sources—and they all look different.

"There's no standardization in how data is organized and classified across medical records systems," says Sunyang Fu, Ph.D., a Mayo Clinic biomedical informatics researcher.

Even the language people use to talk about health can insert discrepancies in how data are recorded. "If a patient had a recent fall, they might say they 'went down,' 'flipped backwards,' 'tripped on a rug,' or 'hit the back of their head,'" says Dr. Fu. Scientists are learning how to make sense of such widely varied data using natural language processing, known as NLP. NLP is a discipline related to artificial intelligence (AI) that teaches computers how to understand human language. Scientists design NLP algorithms to transform disparate information into structured data in a standardized format that can be analyzed.

Studies that use NLP have demonstrated promise to benefit patients, says Dr. Fu, but there’s a problem. When publishing their NLP research, scientists don’t always share all the "how to" instructions, sometimes because algorithms are protected as intellectual property. This makes it difficult for other scientists to validate or reproduce a study, one of the hallmarks of good science.

"The absence of ‘how to’ instructions is becoming a troubling trend," observes Dr. Fu.

Dr. Fu is the first author of a recently published review that found a wide range of inconsistent reporting practices in NLP-assisted observational research.

This research was funded by Mayo Clinic's Center for Clinical and Translational Science.

CTSA Program In Action Goals
Goal 4: Innovate Processes to Increase the Quality and Efficiency of Translational Research, Particularly of Multisite Trials
Goal 5: Advance the Use of Cutting-Edge Informatics