The drones package provides access to patent datasets containing references to drones in the text. The datasets are intended to be used for training in patent analytics by providing access to raw and cleaned data in one place.
The main drones
dataset consists of 15,570 patent applications that refer to the word drone or drones somewhere in the text. The dataset is based on a search of patent documents from the main patent jurisdictions for the period 1845 to 2017 using the Clarivate Analytics Derwent Innovation database. Additional supplementary datasets may be provided in future updates.
The data within the drones package is intended exclusively for use in training in patent analytics as described on the WIPO analytics home page. It is important to note that the dataset is deliberately noisy in order to work on methods for cleaning data. This is a training dataset and is not intended to be used to make definitive statements about patent activity for drone technology and it is not expected to be complete. The drones patent dataset is part of work in progress on the WIPO Patent Analytics Handbook.
The drones dataset is contained in an R package but you can use the drones dataset with or without R.
If you want to work outside RStudio you can download the core drones
reference file as a zip file containing the data in .csv format here. Other datasets within the R package are simply subsets of the core datasets. See the online reference page for details.
If you are using RStudio then you can import the package from Github as follows. If you need to install RStudio follow the instructions to install R here and RStudio here
Make sure you have devtools installed and it is a very good idea to install the tidyverse if you don’t have it already.
Next install the drones package.
The datasets are fully documented inside the R package.
The core drones dataset is a table with 15,570 observations of 22 variables:
abstract
The original document abstract, a character vector. 12,452 (94%)
abstract_english
The English document abstract, a character vector. 12,027 (95%)
application_number
The long application number including the date, a character vector. 15776, (100%)
basic_patent_date
Derwent Innovation basic patent date, a character vector. 2325 (93%)
basic_patent_number
The Derwent Innovation basic patent number forming the base for the dwpi_family, a character vector. 10,281 (93%)
applicant
The original applicant or assignee name, a character vector. 7488 (90%)
applicant_cleaned
A cleaned version of the applicant name, a character vector. 6744 (90%)
cited_nonpatent
Literature citations, field is noisy, a character vector. 28361 (30%)
cited_patents
Patents cited in one or more documents, a character vector. 93703 (71%)
citing_patents
Patents citing one or more documents, a character vector. 64328 (39%)
cpc
The Cooperative Patent Classification Codes (CPC), a character vector 17077 (92%)
dwpi_family_dates
Family dates for DWPI family numbers (Derwent World Patent Index), a character vector. 5153 (93%)
dwpi_family_kind
Document kind codes for DWPI Family members, a character vector. 42 (93%)
dwpi_family_numbers
DWPI family members, a character vector. 30984 (93%)
first_claim
The first claim in a patent document, a character vector. 13332 (97%)
inpadoc_family_members
INPADOC Family Members in long format with dates, a character vector. 49,625, (98%)
inpadoc_first_family_member
The earliest publication number in the inpadoc_family_members
based on the date, a character vector. 9020 (98%)
inventor
The original inventor name, a character vector. 19293 (94%)
ipc
International Patent Classification (IPC) codes, a character vector. 8489 (98%)
priority_number
Patent priority numbers in long for with dates, a character vector. 23379 (99%)
publication_number
Publication numbers in short form minus dates, a character vector. 15776 (100%)
publication_year
The year of publication of the publication numbers, a character vector. 145 (99%)
related_application_numbers
Details of related patent applications, a character vector. 7124 (35%)
title_english
The English title, a character vector. 10815 (99%)
title_original
The original title, normally concatenated as English, French, German etc., character vector. 13,753 (97%)
Note that the coverage of each field typically does not add up to 100 percent of the documents. The numbers provided above are intended to provide reference counts for cross-checking when developing counts of the data. Typical reasons for variance from these counts will be failing to trim leading and trailing white space when separating concatenated fields on the semi-colon and NAs (Not Available) that appear in the data in R where there is less than 100% coverage. Where variance between the reference numbers is high you should investigate why.
For R users it is possible that foreign characters are present in the main text fields. In the event you run into problems please raise an issue and provide details of the problem.
Patent data is not tidy. Many of the columns contain data that is concatenated (joined) with a semicolon as a delimeter. In the case of the Lens patent data the delimited is a double semicolon.
This means that in order to access the data for counting you will need to separate the data onto the relevant row and you will also need to trim the data. You can do this easily with the tidyverse packages.
Load the library
To demonstrate this let’s separate out the applicant (assignee) field and then count it up:
applicants <- drones::applicants %>%
separate_rows(applicant_cleaned, sep = ";") %>%
mutate(applicant_cleaned = str_trim(applicant_cleaned, side = "both")) %>%
drop_na(applicant_cleaned) # drop not available entries
applicants %>% count(applicant_cleaned, sort = TRUE)
## # A tibble: 6,929 x 2
## applicant_cleaned n
## <chr> <int>
## 1 QUALCOMM Incorporated 498
## 2 Thales 382
## 3 HON HAI PRECISION INDUSTRY CO LTD 345
## 4 QINGHUA UNIV 343
## 5 Samsung Electronics Co. Ltd. 213
## 6 International Business Machines Corporation 193
## 7 THE BOEING COMPANY 181
## 8 GOOGLE INC. 167
## 9 Elwha LLC 166
## 10 SONY CORPORATION 148
## # ... with 6,919 more rows
If you wanted to run the same operation as a reusable function you could use this. Note that this has been adapted for tidy evaluation. In some fields it will be common for NA (not available) to appear prominently. Consider adding dplyr::drop_na
to address these cases.
separate_rows_trim <- function(df, col, sep){
df %>% tidyr::separate_rows(col, sep = sep) %>%
dplyr::mutate(!!col := stringr::str_trim(.[[col]], side = "both")) %>%
dplyr::count(!!col := .[[col]], sort = TRUE)
}
We’ll add more to the package by the way of guides and examples as we develop it. For the moment this helps to get you started.
The lovely looking drone icon for the package was made by Nikita Golubev from www.flaticon.com