| Databases
> Medline Data
Description | Origins
| Data Format and Size | Data
Quality | Data Cleaning | Acknowledgments

Description |
|
PubMed, a service of the National Library
of Medicine, includes over 14 million citations for biomedical articles
back to the 1950's. These citations are from MEDLINE and additional life
science journals.

Origins |
|
The MEDLINE baseline database was acquired via a License
of NLM® Data.

Data Format |
|
Raw Data:
The data element descriptions and the MEDLINE update chart
are available from the Information
for Licensees of NLM® Data along with other information.
Data Fields:
ORIGINAL_MEDLINE_ID VARCHAR2(250)
PUB_MED_UNIQUE_IDENTIFIER VARCHAR2(255)
DATE_CREATED DATE
DATE_COMPLETED DATE
INT_STANDARD_SERIAL_NUMBER VARCHAR2(2550)
VOLUME VARCHAR2(2550)
ISSUE VARCHAR2(2550)
AFFILIATION VARCHAR2(2000)
DATE_REVISED DATE
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
PUBLICATION_TYPE VARCHAR2(2000)
LANGUAGE VARCHAR2(2500)
COUNTRY VARCHAR2(2550)
DATA_IS_OK CHAR(1)
JOURNAL_TITLE_ABBREVIATION VARCHAR2(1000)
NLM_UNIQUE_JOURNAL_ID VARCHAR2(2500)
CHEMICAL_LIST VARCHAR2(4000)
CITATION_SUBSET VARCHAR2(4000)
DATE_PUBLISHED DATE
TITLE VARCHAR2(2550)
JOURNAL_NAME VARCHAR2(2550)
DATE_ENTERED NOT NULL TIMESTAMP(6)
COLLECTION VARCHAR2(100)
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
MESH_HEADING VARCHAR2(255)
Statistics:
Years covered: 1963-2002, total number of records: 11,693,477 (detailed
statistics)
Storage Space Required:
392 Files, each ~135M Bytes (gunziped). The data will occupy about 60
GB space on the Oracle server (20 GB for “medline_t”, and
40 GB for other tables).
By Aug 2005, we will probably have 15,000,000 medline records, which need
59 GBytes space for table and index:18 GBytes for "medline_table", 24
GBytes for "document_table", 5 Gbytes for "created_by", and 12 Gbytes
for "author_table"

Data Quality |
|
The original data is in XML format and rather clean. There are
minor consitency issues in the dataset:
- Some records only have “Year” and “Month”
in the Date fields (Date_Published, DateRevised)
- Some records have PageNumber string that cannot be divided into Starting_Number
and Ending_Number (e.g., “737-808-9-811-2”, “798-801;
quiz 802-3 798”.)
- Some records do not have DateRevised field value.
- Some date fields (records) use MonthName while others use MonthNumber.
|

Data Cleaning
The dataset XML parser takes care of the inconsistency problems
mentioned above before uploading the records into the database.
- MeshHeadingList goes into the table “thesaurus_t”
(many-to-one relationship with “medline_t”)
- ChemicalList goes into the table “chemical_list_t”
(many-to-one relationship with “medline_t”)
- Author fields go into “author_t” and “created_by_t”
|
|

Acknowledgements |
|
This data set description was compiled by Ketan
K. Mane, Weimao
Ke, Katy
Börner and Caroline
Courtney.

Information Visualization CyberInfraStructure
@ SLIS, Indiana University
Last Modified June 04, 2004 |