A Patent Dataset for Drone Technology • drones

The drones package provides access to patent datasets containing references to drones in the text. The datasets are intended to be used for training in patent analytics by providing access to raw and cleaned data in one place.

The main drones dataset consists of 15,570 patent applications that refer to the word drone or drones somewhere in the text. The dataset is based on a search of patent documents from the main patent jurisdictions for the period 1845 to 2017 using the Clarivate Analytics Derwent Innovation database. Additional supplementary datasets may be provided in future updates.

The data within the drones package is intended exclusively for use in training in patent analytics as described on the WIPO analytics home page. It is important to note that the dataset is deliberately noisy in order to work on methods for cleaning data. This is a training dataset and is not intended to be used to make definitive statements about patent activity for drone technology and it is not expected to be complete. The drones patent dataset is part of work in progress on the WIPO Patent Analytics Handbook.

The drones dataset is contained in an R package but you can use the drones dataset with or without R.

Download for use outside RStudio

If you want to work outside RStudio you can download the core drones reference file as a zip file containing the data in .csv format here. Other datasets within the R package are simply subsets of the core datasets. See the online reference page for details.

Install in RStudio

If you are using RStudio then you can import the package from Github as follows. If you need to install RStudio follow the instructions to install R here and RStudio here

Make sure you have devtools installed and it is a very good idea to install the tidyverse if you don’t have it already.

install.packages("devtools")
install.packages("tidyverse")

Next install the drones package.

devtools::install_github("wipo-analytics/drones")

library(drones)

What is in the datasets

The datasets are fully documented inside the R package.

The core drones dataset is a table with 15,570 observations of 22 variables:

abstract The original document abstract, a character vector. 12,452 (94%)
abstract_english The English document abstract, a character vector. 12,027 (95%)
application_number The long application number including the date, a character vector. 15776, (100%)
basic_patent_date Derwent Innovation basic patent date, a character vector. 2325 (93%)
basic_patent_number The Derwent Innovation basic patent number forming the base for the dwpi_family, a character vector. 10,281 (93%)
applicant The original applicant or assignee name, a character vector. 7488 (90%)
applicant_cleaned A cleaned version of the applicant name, a character vector. 6744 (90%)
cited_nonpatent Literature citations, field is noisy, a character vector. 28361 (30%)
cited_patents Patents cited in one or more documents, a character vector. 93703 (71%)
citing_patents Patents citing one or more documents, a character vector. 64328 (39%)
cpc The Cooperative Patent Classification Codes (CPC), a character vector 17077 (92%)
dwpi_family_dates Family dates for DWPI family numbers (Derwent World Patent Index), a character vector. 5153 (93%)
dwpi_family_kind Document kind codes for DWPI Family members, a character vector. 42 (93%)
dwpi_family_numbers DWPI family members, a character vector. 30984 (93%)
first_claim The first claim in a patent document, a character vector. 13332 (97%)
inpadoc_family_members INPADOC Family Members in long format with dates, a character vector. 49,625, (98%)
inpadoc_first_family_member The earliest publication number in the inpadoc_family_members based on the date, a character vector. 9020 (98%)
inventor The original inventor name, a character vector. 19293 (94%)
ipc International Patent Classification (IPC) codes, a character vector. 8489 (98%)
priority_number Patent priority numbers in long for with dates, a character vector. 23379 (99%)
publication_number Publication numbers in short form minus dates, a character vector. 15776 (100%)
publication_year The year of publication of the publication numbers, a character vector. 145 (99%)
related_application_numbers Details of related patent applications, a character vector. 7124 (35%)
title_english The English title, a character vector. 10815 (99%)
title_original The original title, normally concatenated as English, French, German etc., character vector. 13,753 (97%)

Note that the coverage of each field typically does not add up to 100 percent of the documents. The numbers provided above are intended to provide reference counts for cross-checking when developing counts of the data. Typical reasons for variance from these counts will be failing to trim leading and trailing white space when separating concatenated fields on the semi-colon and NAs (Not Available) that appear in the data in R where there is less than 100% coverage. Where variance between the reference numbers is high you should investigate why.

For R users it is possible that foreign characters are present in the main text fields. In the event you run into problems please raise an issue and provide details of the problem.

How to separate concatenated columns

Patent data is not tidy. Many of the columns contain data that is concatenated (joined) with a semicolon as a delimeter. In the case of the Lens patent data the delimited is a double semicolon.

This means that in order to access the data for counting you will need to separate the data onto the relevant row and you will also need to trim the data. You can do this easily with the tidyverse packages.

install.packages("tidyverse")

Load the library

library(tidyverse)

To demonstrate this let’s separate out the applicant (assignee) field and then count it up:

applicants <- drones::applicants %>% 
  separate_rows(applicant_cleaned, sep = ";") %>%
  mutate(applicant_cleaned = str_trim(applicant_cleaned, side = "both")) %>% 
  drop_na(applicant_cleaned) # drop not available entries

applicants %>% count(applicant_cleaned, sort = TRUE)

## # A tibble: 6,929 x 2
##    applicant_cleaned                               n
##    <chr>                                       <int>
##  1 QUALCOMM Incorporated                         498
##  2 Thales                                        382
##  3 HON HAI PRECISION INDUSTRY CO LTD             345
##  4 QINGHUA UNIV                                  343
##  5 Samsung Electronics Co. Ltd.                  213
##  6 International Business Machines Corporation   193
##  7 THE BOEING COMPANY                            181
##  8 GOOGLE INC.                                   167
##  9 Elwha LLC                                     166
## 10 SONY CORPORATION                              148
## # ... with 6,919 more rows

If you wanted to run the same operation as a reusable function you could use this. Note that this has been adapted for tidy evaluation. In some fields it will be common for NA (not available) to appear prominently. Consider adding dplyr::drop_na to address these cases.

separate_rows_trim <- function(df, col, sep){
  df %>% tidyr::separate_rows(col, sep = sep) %>% 
    dplyr::mutate(!!col := stringr::str_trim(.[[col]], side = "both")) %>% 
    dplyr::count(!!col := .[[col]], sort = TRUE)
}

We’ll add more to the package by the way of guides and examples as we develop it. For the moment this helps to get you started.

Attribution

The lovely looking drone icon for the package was made by Nikita Golubev from www.flaticon.com

The Drones Patent Dataset

Download for use outside RStudio

Install in RStudio

What is in the datasets

How to separate concatenated columns

Attribution

Links

License

Developers

Dev status