Go to Google Home
Go
A data-code-compute resource for research and education in information visualization
InfoVis Home Learning Modules Software Databases Compute Resources References

Databases > Medline Data

Description | Origins | Data Format and Size | Data Quality | Data Cleaning | Acknowledgments


line
Description

PubMed, a service of the National Library of Medicine, includes over 14 million citations for biomedical articles back to the 1950's. These citations are from MEDLINE and additional life science journals.


line
Origins

The MEDLINE baseline database was acquired via a License of NLM® Data.


line
Data Format

Raw Data:
The data element descriptions and the MEDLINE update chart are available from the Information for Licensees of NLM® Data along with other information.

Data Fields:
ORIGINAL_MEDLINE_ID VARCHAR2(250)
PUB_MED_UNIQUE_IDENTIFIER VARCHAR2(255)
DATE_CREATED DATE
DATE_COMPLETED DATE
INT_STANDARD_SERIAL_NUMBER VARCHAR2(2550)
VOLUME VARCHAR2(2550)
ISSUE VARCHAR2(2550)
AFFILIATION VARCHAR2(2000)
DATE_REVISED DATE
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
PUBLICATION_TYPE VARCHAR2(2000)
LANGUAGE VARCHAR2(2500)
COUNTRY VARCHAR2(2550)
DATA_IS_OK CHAR(1)
JOURNAL_TITLE_ABBREVIATION VARCHAR2(1000)
NLM_UNIQUE_JOURNAL_ID VARCHAR2(2500)
CHEMICAL_LIST VARCHAR2(4000)
CITATION_SUBSET VARCHAR2(4000)
DATE_PUBLISHED DATE
TITLE VARCHAR2(2550)
JOURNAL_NAME VARCHAR2(2550)
DATE_ENTERED NOT NULL TIMESTAMP(6)
COLLECTION VARCHAR2(100)
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
MESH_HEADING VARCHAR2(255)

Statistics:
Years covered: 1963-2002, total number of records: 11,693,477 (detailed statistics)

Storage Space Required:
392 Files, each ~135M Bytes (gunziped). The data will occupy about 60 GB space on the Oracle server (20 GB for “medline_t”, and 40 GB for other tables).
By Aug 2005, we will probably have 15,000,000 medline records, which need 59 GBytes space for table and index:18 GBytes for "medline_table", 24 GBytes for "document_table", 5 Gbytes for "created_by", and 12 Gbytes for "author_table"


line
Data Quality

The original data is in XML format and rather clean. There are minor consitency issues in the dataset:

  • Some records only have “Year” and “Month” in the Date fields (Date_Published, DateRevised)
  • Some records have PageNumber string that cannot be divided into Starting_Number and Ending_Number (e.g., “737-808-9-811-2”, “798-801; quiz 802-3 798”.)
  • Some records do not have DateRevised field value.
  • Some date fields (records) use MonthName while others use MonthNumber.


line
Data Cleaning

The dataset XML parser takes care of the inconsistency problems mentioned above before uploading the records into the database.

  • MeshHeadingList goes into the table “thesaurus_t” (many-to-one relationship with “medline_t”)
  • ChemicalList goes into the table “chemical_list_t” (many-to-one relationship with “medline_t”)
  • Author fields go into “author_t” and “created_by_t”

line
Acknowledgements

This data set description was compiled by Ketan K. Mane, Weimao Ke, Katy Börner and Caroline Courtney.

line
Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004