Databases
> Medline Data
Description | Origins
| Data Format and Size | Data
Quality | Data Cleaning | Acknowledgments
Description |
|
PubMed, a service of the National Library
of Medicine, includes over 14 million citations for biomedical articles
back to the 1950's. These citations are from MEDLINE and additional life
science journals.
Origins |
|
The MEDLINE baseline database was acquired via a License
of NLM® Data.
Data Format |
|
Raw Data:
The data element descriptions and the MEDLINE update chart
are available from the Information
for Licensees of NLM® Data along with other information.
Data Fields:
ORIGINAL_MEDLINE_ID VARCHAR2(250)
PUB_MED_UNIQUE_IDENTIFIER VARCHAR2(255)
DATE_CREATED DATE
DATE_COMPLETED DATE
INT_STANDARD_SERIAL_NUMBER VARCHAR2(2550)
VOLUME VARCHAR2(2550)
ISSUE VARCHAR2(2550)
AFFILIATION VARCHAR2(2000)
DATE_REVISED DATE
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
PUBLICATION_TYPE VARCHAR2(2000)
LANGUAGE VARCHAR2(2500)
COUNTRY VARCHAR2(2550)
DATA_IS_OK CHAR(1)
JOURNAL_TITLE_ABBREVIATION VARCHAR2(1000)
NLM_UNIQUE_JOURNAL_ID VARCHAR2(2500)
CHEMICAL_LIST VARCHAR2(4000)
CITATION_SUBSET VARCHAR2(4000)
DATE_PUBLISHED DATE
TITLE VARCHAR2(2550)
JOURNAL_NAME VARCHAR2(2550)
DATE_ENTERED NOT NULL TIMESTAMP(6)
COLLECTION VARCHAR2(100)
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
MESH_HEADING VARCHAR2(255)
Statistics:
Years covered: 1963-2002, total number of records: 11,693,477 (detailed
statistics)
Storage Space Required:
392 Files, each ~135M Bytes (gunziped). The data will occupy about 60
GB space on the Oracle server (20 GB for “medline_t”, and
40 GB for other tables).
By Aug 2005, we will probably have 15,000,000 medline records, which need
59 GBytes space for table and index:18 GBytes for "medline_table", 24
GBytes for "document_table", 5 Gbytes for "created_by", and 12 Gbytes
for "author_table"
Data Quality |
|
The original data is in XML format and rather clean. There are
minor consitency issues in the dataset:
- Some records only have “Year” and “Month”
in the Date fields (Date_Published, DateRevised)
- Some records have PageNumber string that cannot be divided into Starting_Number
and Ending_Number (e.g., “737-808-9-811-2”, “798-801;
quiz 802-3 798”.)
- Some records do not have DateRevised field value.
- Some date fields (records) use MonthName while others use MonthNumber.
Data Cleaning
The dataset XML parser takes care of the inconsistency problems
mentioned above before uploading the records into the database.
- MeshHeadingList goes into the table “thesaurus_t”
(many-to-one relationship with “medline_t”)
- ChemicalList goes into the table “chemical_list_t”
(many-to-one relationship with “medline_t”)
- Author fields go into “author_t” and “created_by_t”
|
|
Acknowledgements |
|
This data set description was compiled by Ketan
K. Mane, Weimao
Ke, Katy
Börner and Caroline
Courtney.
Information Visualization CyberInfraStructure
@ SLIS, Indiana University
Last Modified June 04, 2004 |