Databases
> Patent Data
Description | Origins
| Data Format and Size | Data
Quality | Data Cleaning | Acknowledgments
Description |
|
For over 200 years, the United States
Patent and Trademark Office (USPTO) has been processing and disseminating
patent and trademark applications and information to promotes an understanding
of intellectual property protection and to facilitate the development
and sharing of new technologies worldwide. The office is a federal agency
in the Department of Commerce and employs over 6,500 full time staff.
Origins |
|
Patent data prior to 1996 was generously made available by Steven A.Morris,
Electrical and Computer Engineering, Oklahoma State University. Patent
data from 1996 to present can be downloaded from ftp://ftp.uspto.gov/pub/patdata/.
Patent updates are released once a week on Tuesday.
Data Format |
|
Raw Data:
Please query the USPTO
databases and examine the US
patent classification hierarchy to get familiar with this data set.
The patent bibliographic raw data is available as one zipped file for
each weekly issue, beginning with week 36 of 1996. Within each zip file,
the data appears in "PTO Green Book" format as concatenated
81-character, fixed-length, linefeed -terminated ASCII records. Each file
is approximately 2 to 3 MB zipped, and unzips to a single 20 to 30 MB
ASCII file.
Data Fields:
type varchar2(4000)
ocl_thesaurus_class varchar2(4000)
ocl_thesaurus_subclass varchar2(4000)
data_is_ok char
xcl_thesaurus_class varchar2(4000)
xcl_thesaurus_subclass varchar2(4000)
doc_id number
name varchar2(2000)
address varchar2(2000)
city varchar2(2000)
state varchar2(500)
zipcode1 number
zipcode2 number
country varchar2(500)
last_name varchar2(1000)
middle_name varchar2(1000)
first_name varchar2(1000)
date_published date
title varchar2(2550)
full_text clob
type varchar2(100)
abstract clob
author_seq number
Statistics:
There are a total of 5,402,657 authors (non-unique). Of these, 1,757,094
authors *seem to be* unique. A lot of these are 'middle initial missing'
kind of cases. Hence, there should be no more than 1,200,000 unique authors.
There are a total of 22,650,056 citation links (for the 2582647 records
from 1976 to Feb 2003).
Please read the detailed statistics
to learn more about the coverage and completeness of data.
Storage Space Required:
Number of Entries: 2,582,647. For the years 1963-2005 we estimate
a total of 2,640,000 and a size of 350 MB.
( We currently have 2,582,647 patent records for the years 1976 - Feb
2003. The years 2003-2005 should account for another 55,000 patent records).
See also Kevin Boyack's yearly statistics.
Data Quality |
|
- All patents have titles and patentIDs; all but 3 have date of issue.
So the absolute essentials are in place.
- Around 2.3% do not have citation information - it is possible that
these patents really did not cite any other patents - they could have
cited other non-patent publications though. The number is small enough
to allow for that.
- Around 16.2% do not have information about the Assignee group. This
could mean one of two things - inventor data missing, OR, inventor got
nothing to do with any organization.
- Very few (0.0005%) do not have inventors. Which implies that most
records with missing Assignee groups above, have inventors not affiliated
with any org.
- 3.35% do not have OCL information; 8.66% do not have XCL information.
These are not disturbingly huge numbers, but it's still hard to imagine
a patent not being classified into any category at all. Interestingly
enough, there are quite a few records that do not have OCL but do have
XCL information.
- 99.6% of the current dataset is patents of type 'Utility'.
These issues will be documented further when the data is uploaded into
the Oracle database.
Data Cleaning |
|
The Patent Number field of our
patent dataset has extra characters that are not part of the patent numbers
issued by the USPTO. They vary by patent type and
are as follows:
- Utility patents have an extra character at the beginning and one at
the end.
- SIR’s have four 0’s after letter “H” and an
extra character at the end.
- Design patents have an extra character after letter “D”
and one at the end.
- Reissue patents have an extra character after “RE” and
one at the end.
- Defensive Publications have an extra character after “T”
and one at the end.
- Plant Patents have two 0’s after “PP” and an extra
one at the end.
Acknowledgements |
|
We are grateful to Steven A.Morris, Electrical and Computer Engineering,
Oklahoma State University for making a larger patent data set available
to us and for his guidance in parsing and analyzing the data.
This data set description was compiled by Ruchi Kapoor,
Katy Börner,
and Caroline Courtney.
Information Visualization CyberInfraStructure
@ SLIS, Indiana University
Last Modified June 04, 2004 |