InfoVis CyberInfrastructure

Databases > Medline Data

Description

PubMed, a service of the National Library of Medicine, includes over 14 million citations for biomedical articles back to the 1950's. These citations are from MEDLINE and additional life science journals.

Origins

Top

The MEDLINE baseline database was acquired via a License of NLM® Data.

Data Format

Top

Raw Data:
The data element descriptions and the MEDLINE update chart are available from the Information for Licensees of NLM® Data along with other information.

Data Fields:
ORIGINAL_MEDLINE_ID VARCHAR2(250)
PUB_MED_UNIQUE_IDENTIFIER VARCHAR2(255)
DATE_CREATED DATE
DATE_COMPLETED DATE
INT_STANDARD_SERIAL_NUMBER VARCHAR2(2550)
VOLUME VARCHAR2(2550)
ISSUE VARCHAR2(2550)
AFFILIATION VARCHAR2(2000)
DATE_REVISED DATE
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
PUBLICATION_TYPE VARCHAR2(2000)
LANGUAGE VARCHAR2(2500)
COUNTRY VARCHAR2(2550)
DATA_IS_OK CHAR(1)
JOURNAL_TITLE_ABBREVIATION VARCHAR2(1000)
NLM_UNIQUE_JOURNAL_ID VARCHAR2(2500)
CHEMICAL_LIST VARCHAR2(4000)
CITATION_SUBSET VARCHAR2(4000)
DATE_PUBLISHED DATE
TITLE VARCHAR2(2550)
JOURNAL_NAME VARCHAR2(2550)
DATE_ENTERED NOT NULL TIMESTAMP(6)
COLLECTION VARCHAR2(100)
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
MESH_HEADING VARCHAR2(255)

Statistics:
Years covered: 1963-2002, total number of records: 11,693,477 (detailed statistics)

Storage Space Required:
392 Files, each ~135M Bytes (gunziped). The data will occupy about 60 GB space on the Oracle server (20 GB for “medline_t”, and 40 GB for other tables).
By Aug 2005, we will probably have 15,000,000 medline records, which need 59 GBytes space for table and index:18 GBytes for "medline_table", 24 GBytes for "document_table", 5 Gbytes for "created_by", and 12 Gbytes for "author_table"

Data Quality

Top

The original data is in XML format and rather clean. There are minor consitency issues in the dataset:

Some records only have “Year” and “Month” in the Date fields (Date_Published, DateRevised)
Some records have PageNumber string that cannot be divided into Starting_Number and Ending_Number (e.g., “737-808-9-811-2”, “798-801; quiz 802-3 798”.)
Some records do not have DateRevised field value.
Some date fields (records) use MonthName while others use MonthNumber.

Data Cleaning

The dataset XML parser takes care of the inconsistency problems mentioned above before uploading the records into the database.

MeshHeadingList goes into the table “thesaurus_t” (many-to-one relationship with “medline_t”)
ChemicalList goes into the table “chemical_list_t” (many-to-one relationship with “medline_t”)
Author fields go into “author_t” and “created_by_t”

Top

Acknowledgements

Top

This data set description was compiled by Ketan K. Mane, Weimao Ke, Katy Börner and Caroline Courtney.

Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004