Databases
> National Institute of Health (NIH) Grant Award Data
Description | Origins
| Data Format and Size| Data
Quality | Data Cleaning | Acknowledgments
Description |
|
The National Institute of Health, which
is comprised of 27 institutes and centers, is an agency of the Department
of Health and Human Services in US. It awards grants for the support of
basic or clinical biomedical, behavioral, and bioengineering research.
Origins |
|
The CRISP (Computer Retrieval
of Information on Scientific Projects) is a searchable database of federally
funded biomedical research projects conducted at universities, hospitals,
and other research institutions. The database, maintained by the Office
of Extramural Research at the National Institutes of Health, includes
projects funded by the National Institutes of Health (NIH), Substance
Abuse and Mental Health Services (SAMHSA), Health Resources and Services
Administration (HRSA), Food and Drug Administration (FDA), Centers for
Disease Control and Prevention (CDCP), Agency for Health Care Research
and Quality (AHRQ), and Office of Assistant Secretary of Health (OASH).
Information on NIH Award amounts is available at the Award
Data web site.
Data Format |
|
Raw Data:
Please query the NIH awards data base via CRISP
to get familiar with this data set.
Data Fields:
- Grant Number number
- PI First Name varchar2(1000)
- PI Middle Name varchar2(1000)
- PI Last Name varchar2(1000)
- PI Email varchar2(1000)
- PI Title varchar2(4000)
- Project Title varchar2(2550)
- Abstract clob
- Thesaurus Terms varchar2(4000)
- Institution Name varchar2(2000)
- Institution Address varchar2(2000)
- Institution City varchar2(2000)
- Institution State varchar2(500)
- Institution Zipcode1 number
- Institution Zipcode2 number
- Institution Country varchar2(500)
- Fiscal Year date
- Department varchar2(400)
- Project Start date
- Project End date
- Institues Centers Divisions (ICD) varchar2(400)
- Integrated Review Group (IRG) varchar2(4000)
- Amount number
- Keywords varchar2(255)
Statistics:
Years covered: 1972-2004, total 1,028,521records (detailed
statistics)
Storage Space Required:
N umber of records per year years = 70,000. Estimated number of total
records in 2005 = 1,030,000.
Approximately 2.3 GB of raw data by 2005.
Data Quality |
|
There are missing
- PI_Email
- PI_Title
- Institution
- Department
- "pipe delimiters" - make some rows show less than 14 columns
- There "duplicates" same information BUT with different
Grant_Number (e.g. data00.txt#44899, data00.txt#44900)
- Some "duplicates" are the lesser version of its counterpart
(e.g. data00.txt#44818, data00.txt#44819 - data00.txt does NOT have
an abstract information)
- Abstract - there are many records which have the word "DESCRIPTION"
in front. Missing abstracts are identified as "This abstract is
not available" OR "There is no text on file for this abstract".
- Non existent thesaurus terms are identified with "There are no
thesaurus terms on file for this project"
- Street Address information is in TAB-delimited format
- Many records were missing the final ("IRG") field. Instead
of pipe delimiters in some records, there were occurrences of row headers
as delimiters, such as "Project Start," "Project End,"
"ICD," and "IRG."
Detailed statistics on missing data will be compiled when the data is
uploaded into Oracle.
Data Cleaning |
|
Records which were missing the final ("IRG") field had a blank
field added to get the correct record length. Also, row header delimiters
(such as "ICD" and "IRG" as mentioned above) were
replaced by pipe delimiters (|) in as many cases as possible.
Acknowledgements |
|
This data set description was compiled by Jay
Askren, Saiful
Bahari, Chris
Friend, Katy
Börner and Caroline
Courtney.
Information Visualization CyberInfraStructure
@ SLIS, Indiana University
Last Modified June 04, 2004 |