Databases > Proc. of the National
Academy of Sciences (PNAS) Dataset
Description | Origins
| Data Format and Size | Data
Quality | Data Cleaning | Acknowledgments
Description |
|
The data set comprises full text documents from the
Proceedings of the National Academy of
Sciences covering 01-07-1997 to 09-17-2002 (148 issues containing
some 93,000 journal pages).
The ARTICLE and CITATION tables were obtained by
parsing original SGML files from PNAS. The AUTHOR (with author order),
MeSH (Medical Subject Headings) terms, and Medline UI (unique identifier)
tables were added by joining Medline records with the original PNAS-supplied
records for the same articles using the UNIQUE_KEY.
The data is also available on Microsoft Access 97
format.
Origins |
|
The data set was provided by the PROCEEDINGS OF THE NATIONAL
ACADEMY OF SCIENCES USA (PNAS) for the Arthur M. Sackler Colloquium, Mapping
Knowledge Domains, held May 9-11, 2003.
It is available for research and educational purposes to anybody registered
for the Sackler Colloquium on Mapping Knowledge Domains who sigend the
copyright form.
It cannot be redistributed without prior permission from PNAS. It cannot
be used for commercial purposes.
Data Format |
|
Raw Data:
The data is available in MS-Access *.mdb format in the relational
databse set-up. The relational diagram is available here.
Data Fields:
PNAS_ID NOT NULL NUMBER
ISSUE_NUMBER NUMBER
VOLUME_ID NUMBER
PAGE_NUMBER_START NUMBER
PAGE_NUMBER_END NUMBER
ARTICLE_ID NUMBER
ABSTRACT CLOB
LAST_NAME NOT NULL VARCHAR2(1000)
MIDDLE_NAME VARCHAR2(1000)
FIRST_NAME VARCHAR2(1000)
AUTHOR_ORDER NUMBER
CITATION_TITLE VARCHAR2(2550)
CITATION_YEAR NUMBER
CITATION VOL NUMBER
CITATION_PAGE NUMBER
CITED_LAST_NAME VARCHAR2(1000)
CITED_MIDDLE_NAME VARCHAR2(1000)
CITED_FIRST_NAME VARCHAR2(1000)
MESH_TERMS VARCHAR(255)
FULL_TEXT CLOB
AFFILIATION VARCHAR(2000)
TITLE VARCHAR2(2550)
DATA_IS_OK CHAR(1)
Statistics:
Years covered: 01-07-1997 to 09-17-2002. Number of records: 16,169.
Total number of authors: 80,856
Number of unique authors: 54,074
Year |
# of Citations |
# of Records |
1997 |
92,652 |
2722 |
1998 |
92,263 |
2879 |
1999 |
125,793 |
2830 |
2000 |
177,935 |
2701 |
2001 |
188,916 |
2814 |
2002 |
149,270 |
2223 |
Total |
826,829 |
16,169 |
Storage Space Required:
Number of Entries: 16,169 in six files with a total of 583 MB.
Data Quality |
|
Year |
Articles |
Affiliation |
AID |
Abstract |
Full-text |
1997 |
51 |
51 |
226 |
145 |
0 |
1998 |
72 |
67 |
138 |
172 |
16 |
1999 |
76 |
78 |
92 |
221 |
16 |
2000 |
57 |
62 |
87 |
197 |
18 |
2001 |
73 |
77 |
75 |
242 |
0 |
2002 |
16 |
15 |
3 |
106 |
0 |
Total |
345 |
350 |
621 |
1083 |
50 |
Data Cleaning |
|
The original SGML files provided by PNAS were parsed, author names and
MeSH terms were added.
Acknowledgements |
|
We would like to thank Kevin W. Boyack, Sandia National Laboratories
and Jason Baumgartner, Indiana University for all the effort they spent
parsing the SGML files provided by PNAS, adding author names and MeSH
terms, and making the data available in a format that can be easily used.
This description was prepared by Ketan
K. Mane, Weimao
Ke, Katy
Börner and Caroline
Courtney.
Information Visualization CyberInfraStructure
@ SLIS, Indiana University
Last Modified June 04, 2004
|