InfoVis CyberInfrastructure

Databases > National Institute of Health (NIH) Grant Award Data

Description

The National Institute of Health, which is comprised of 27 institutes and centers, is an agency of the Department of Health and Human Services in US. It awards grants for the support of basic or clinical biomedical, behavioral, and bioengineering research.

Origins

Top

The CRISP (Computer Retrieval of Information on Scientific Projects) is a searchable database of federally funded biomedical research projects conducted at universities, hospitals, and other research institutions. The database, maintained by the Office of Extramural Research at the National Institutes of Health, includes projects funded by the National Institutes of Health (NIH), Substance Abuse and Mental Health Services (SAMHSA), Health Resources and Services Administration (HRSA), Food and Drug Administration (FDA), Centers for Disease Control and Prevention (CDCP), Agency for Health Care Research and Quality (AHRQ), and Office of Assistant Secretary of Health (OASH).

Information on NIH Award amounts is available at the Award Data web site.

Data Format

Top

Raw Data:
Please query the NIH awards data base via CRISP to get familiar with this data set.

Data Fields:

Grant Number number
PI First Name varchar2(1000)
PI Middle Name varchar2(1000)
PI Last Name varchar2(1000)
PI Email varchar2(1000)
PI Title varchar2(4000)
Project Title varchar2(2550)
Abstract clob
Thesaurus Terms varchar2(4000)
Institution Name varchar2(2000)
Institution Address varchar2(2000)
Institution City varchar2(2000)
Institution State varchar2(500)
Institution Zipcode1 number
Institution Zipcode2 number
Institution Country varchar2(500)
Fiscal Year date
Department varchar2(400)
Project Start date
Project End date
Institues Centers Divisions (ICD) varchar2(400)
Integrated Review Group (IRG) varchar2(4000)
Amount number
Keywords varchar2(255)

data_is_ok char

Statistics:
Years covered: 1972-2004, total 1,028,521records (detailed statistics)

Storage Space Required:
N umber of records per year years = 70,000. Estimated number of total records in 2005 = 1,030,000.
Approximately 2.3 GB of raw data by 2005.

Data Quality

Top

There are missing

PI_Email
PI_Title
Institution
Department
"pipe delimiters" - make some rows show less than 14 columns
There "duplicates" same information BUT with different Grant_Number (e.g. data00.txt#44899, data00.txt#44900)
Some "duplicates" are the lesser version of its counterpart (e.g. data00.txt#44818, data00.txt#44819 - data00.txt does NOT have an abstract information)
Abstract - there are many records which have the word "DESCRIPTION" in front. Missing abstracts are identified as "This abstract is not available" OR "There is no text on file for this abstract".
Non existent thesaurus terms are identified with "There are no thesaurus terms on file for this project"
Street Address information is in TAB-delimited format
Many records were missing the final ("IRG") field. Instead of pipe delimiters in some records, there were occurrences of row headers as delimiters, such as "Project Start," "Project End," "ICD," and "IRG."

Detailed statistics on missing data will be compiled when the data is uploaded into Oracle.

Data Cleaning

Top

Records which were missing the final ("IRG") field had a blank field added to get the correct record length. Also, row header delimiters (such as "ICD" and "IRG" as mentioned above) were replaced by pipe delimiters (|) in as many cases as possible.

Acknowledgements

Top

This data set description was compiled by Jay Askren, Saiful Bahari, Chris Friend, Katy Börner and Caroline Courtney.

Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004