Chapter 6 Datasets

In this chapter we introduce the patent datasets developed for the Open Source Patent Analytics Project as training sets for patent analytics. The datasets will be used in the walkthroughs. The datasets will grow over time but we will briefly introduce them and explain how to access them.

The datasets are housed at the project GitHub repository. To download individual files click on the link and then select raw to download the file.

6.1 The datasets

The datasets are intended to illustrate the range of possibilities for patent data including some of the challenges that may be encountered in cleaning and analysing patent data. They are also drawn from different sources.

6.1.1 Core dataset

The pizza_medium_clean dataset is a precleaned version of the pizza_medium dataset. Specifically, the applicants and inventors field have already been cleaned along with corrupted characters and other common cleaning tasks. This makes it easier to work with the data.

This dataset is used in the Chapter on Tableau Public and Gephi. You can download the files for those chapters through the datasets homepage or directly as a zip file from this link.

6.1.2 Sample datasets

Almost everyone likes pizza and it is easy to search a patent database for the term “pizza”. It is also an area of patent activity that encompasses a wide range of technologies such as pizza ovens, pizza boxes, pizza cutters and pizza toppings etc. It is therefore useful for demonstrating ways of interrogating patent data for particular topics.

pizza_small is a very small 26 row dataset created by downloading the first page of results from the European Patent Office espacenet database for a smart search on “pizza”. It’s a quick and easy test dataset.
pizza_medium was created from a sample of data from a search of the WIPO Patentscope database for the term “pizza” and contains 9,996 rows of data. It is intended to illustrate the data format from Patentscope and to allow work on a medium sized dataset. Note that the format varies from the espacenet format and presents different challenges. An important feature of Patentscope data from a statistical standpoint is that the field marked publication_number in the original data lacks a two letter kind code and is therefore an application_number.

6.1.3 Patent Landscape Reports datasets

Three datasets are drawn from the WIPO Patent Landscape Reports. The datasets address different topics, present a variety of fields and formats and are different sizes. Each dataset is linked to a detailed patent landscape report that provides an insight into approaches to patent analytics.

ewaste presents the results of research for a report on patent activity for electronic waste recycling and its implications for developing countries.
solar_cooking presents the data supporting a report on technologies that use solar energy as the source for cooking and pasteurizing food.
ritonavir presents the data for a patent report on patent activity for the HIV antiretroviral drug Ritonavir in the field of pharmaceuticals. The dataset illustrates specific activity around issues such as dosage and also the problem of ‘evergreening’ in patent activity.
Artificial Intelligence. The WIPO Technology Trends 2019 report focused on artificial intelligence. The raw data used for the report is not available for direct download but the data for individual figures is available in a set of multi-sheet excel tables for those interested. In addition, WIPO offers an Artificial Intelligence Index page with links from the search terms used to generate the report data in WIPO Patentscope. For example, the Machine Learning link will generate a new dataset with results. Users who have signed up for a free Patentscope account can then refine the search and download basic data fields for up to 10,000 records.
Assistive Technology. The WIPO Technology Trends 2021 report focused on assistive technologies. The report includes two datasets in Excel.