Import Bulk PatentsView Data with R

Importing USPTO PatentsView Data Download Files with R.

Paul Oldham https://github.com/wipo-analytics
2022-01-13

Introduction

In the last article we explored how to download USPTO PatentsView patent data files. In the process we used web scraping with the rvest package to help us identify the files to download and to keep a record of that we could store with the data.

In this article we are going to focus on importing the data files into R. While we will use R, a very similar logic will apply if writing this code in Python.

The USPTO PatentsView data files are a set of zip files that take up around 100 Gigabytes for the granted patents (grants) and a lower 26 GB for applications (called pregrant). In addition to the main tables there are separate yearly download tables for the main text segments of the files consisting of brief summary, the description and the claims. If you have been following along then the entire grant directory should look something like Figure 1.

Directory Structure for Downloaded Grants Data

Figure 1: Directory Structure for Downloaded Grants Data

If you want to run the entire US patent collection, including the description, brief summary and claims it is probably best to anticipate around 500Gb of disk space for the full set of files when unzipped.

The question now is how to import this data.

Importing the Bulk Data Files

We have a number of choices when planning to import this data. The best choice for your work will partly depend on what you want to do with the data afterwards, in particular how much of this data do you plan to use? In reality there are three main scenarios:

We will address scenario 1 in this article and then move to the others in the next articles.

Importing

In this article we focus on importing the bulk data using R.

the USPTO patent data files that we downloaded in the previous post. We will mainly address how to import some of the tables into R


If we downloaded some or all of the
```{.r .distill-force-highlighting-css}

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/wipo-analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Oldham (2022, Jan. 13). WIPO Patent Analytics: Import Bulk PatentsView Data with R. Retrieved from https://wipo-analytics.github.io/posts/2022-01-13-patentsview-import-bulk-data/

BibTeX citation

@misc{oldham2022import,
  author = {Oldham, Paul},
  title = {WIPO Patent Analytics: Import Bulk PatentsView Data with R},
  url = {https://wipo-analytics.github.io/posts/2022-01-13-patentsview-import-bulk-data/},
  year = {2022}
}