Go to Google Home
Go
A data-code-compute resource for research and education in information visualization
InfoVis Home Learning Modules Software Databases Compute Resources References

Databases > Proc. of the National Academy of Sciences (PNAS) Dataset

Description | Origins | Data Format and Size | Data Quality | Data Cleaning | Acknowledgments


line
Description

The data set comprises full text documents from the Proceedings of the National Academy of Sciences covering 01-07-1997 to 09-17-2002 (148 issues containing some 93,000 journal pages).

The ARTICLE and CITATION tables were obtained by parsing original SGML files from PNAS. The AUTHOR (with author order), MeSH (Medical Subject Headings) terms, and Medline UI (unique identifier) tables were added by joining Medline records with the original PNAS-supplied records for the same articles using the UNIQUE_KEY.

The data is also available on Microsoft Access 97 format.


line
Origins

The data set was provided by the PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES USA (PNAS) for the Arthur M. Sackler Colloquium, Mapping Knowledge Domains, held May 9-11, 2003.

It is available for research and educational purposes to anybody registered for the Sackler Colloquium on Mapping Knowledge Domains who sigend the copyright form. It cannot be redistributed without prior permission from PNAS. It cannot be used for commercial purposes.


line
Data Format

Raw Data:
The data is available in MS-Access *.mdb format in the relational databse set-up. The relational diagram is available here.

Data Fields:
PNAS_ID NOT NULL NUMBER
ISSUE_NUMBER NUMBER
VOLUME_ID NUMBER
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
ARTICLE_ID NUMBER
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
AUTHOR_ORDER NUMBER
CITATION_TITLE VARCHAR2(2550)
CITATION_YEAR NUMBER
CITATION VOL NUMBER
CITATION_PAGE NUMBER
CITED_LAST_NAME VARCHAR2(1000)
CITED_MIDDLE_NAME VARCHAR2(1000)
CITED_FIRST_NAME VARCHAR2(1000)
MESH_TERMS VARCHAR(255)
FULL_TEXT CLOB
AFFILIATION VARCHAR(2000)
TITLE VARCHAR2(2550)

DATA_IS_OK CHAR(1)

Statistics:
Years covered: 01-07-1997 to 09-17-2002. Number of records: 16,169.
Total number of authors: 80,856
Number of unique authors: 54,074

Year # of Citations # of Records
1997 92,652 2722
1998 92,263 2879
1999 125,793 2830
2000 177,935 2701
2001 188,916 2814
2002 149,270 2223
Total 826,829 16,169

Storage Space Required:
Number of Entries: 16,169 in six files with a total of 583 MB.


line
Data Quality

Year Articles Affiliation AID Abstract Full-text
1997 51 51 226 145 0
1998 72 67 138 172 16
1999 76 78 92 221 16
2000 57 62 87 197 18
2001 73 77 75 242 0
2002 16 15 3 106 0
Total 345 350 621 1083 50


line
Data Cleaning

The original SGML files provided by PNAS were parsed, author names and MeSH terms were added.


line
Acknowledgements

We would like to thank Kevin W. Boyack, Sandia National Laboratories and Jason Baumgartner, Indiana University for all the effort they spent parsing the SGML files provided by PNAS, adding author names and MeSH terms, and making the data available in a format that can be easily used.
This description was prepared by Ketan K. Mane, Weimao Ke, Katy Börner and Caroline Courtney.

line
Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004