Semantic-based distribution and similarity for scientific publications.
Semantic similarity and topic distribution for articles annotated with biotea-annotation (https://github.com/ljgarcia/biotea-annotation) and topic distribution based on Biolinks UMLS group reclassification.
Biolinks relies on annotations following the schema proposed by Biotea. Biolinks uses semantic annotations together with term frequency and inverse document frequence (tf-idf) in order to find a similarity score between a pair of scientific articles. The first article is defined as the query article, i.e., an article for which a reader has already expressed interested. The second article is defined as the related articles, i.e., an article potentially interested for a reader knowing her/his interest in the query article. The similarity metric used by Biolinks is PMRA. Reasons behind the selection of this metric are explained here.
Biolinks also calculates the UMLS group distribution for the TREC topics. Such distribution follows the topic modeling principles. However, rather than discovering topics, it uses UMLS semantic groups as predefined topics and finds the distribution across UMLS groups for articles within a TREC topic. Biolinks works with a reclassification of UMLS semantic groups.
Example using Genomic Track TREC-2005 data
We have created one parser, four visual components and one helper component so it would be easier for us to analyze annotated articles.
If you are interested in semantic annotations, here you can find more information about our semantic annotation project. If you want to learn more about the dataset used, please keep reading.
Here we present a working example showing how Biolinks works. It includes topic distribution, semantic similarity and semantic annotations.
We annotated title-and-abstract as well as full-text articles for articles in the TREC 2005 Genomics Track collection. TREC 2005 Genomics Track comprises a ten-year subset of MEDLINE. It includes 34,633 unique PubMed identifiers (PMID) grouped in 50 topics corresponding to different information needs. For each PMID within a particular topic, domain experts assigned a relevance judgment depending on whether the article was not relevant, partially relevant, or relevant for the topic. We included only those articles that were classified as relevant or partially relevant within the TREC topics with at least both title and abstract, i.e., 4240 articles. From those, only 62 have full-text content publicly available via PMC-OA.