Chapter 4 Datasets

In this chapter we introduce the patent datasets developed for the Open Source Patent Analytics Project as training sets for patent analytics. The datasets will be used in the walkthroughs. The datasets will grow over time but we will briefly introduce them and explain how to access them.

The datasets are housed at the project GitHub repository. To download individual files click on the link and then select raw to download the file.

4.1 The datasets

The datasets are intended to illustrate the range of possibilities for patent data including some of the challenges that may be encountered in cleaning and analysing patent data. They are also drawn from different sources.

4.1.1 Pizza patent datasets

Almost everyone likes pizza and it is easy to search a patent database for the term “pizza”. It is also an area of patent activity that encompasses a wide range of technologies such as pizza ovens, pizza boxes, pizza cutters and pizza toppings etc. It is therefore useful for demonstrating ways of interrogating patent data for particular topics.

pizza_small is a very small 26 row dataset created by downloading the first page of results from the European Patent Office espacenet database for a smart search on “pizza”. It’s a quick and easy test dataset.
pizza_medium was created from a sample of data from a search of the WIPO Patentscope database for the term “pizza” and contains 9,996 rows of data. It is intended to illustrate the data format from Patentscope and to allow work on a medium sized dataset. Note that the format varies from the espacenet format and presents different challenges. An important feature of Patentscope data from a statistical standpoint is that the field marked publication_number in the original data lacks a two letter kind code and is therefore an application_number.
The pizza_medium_clean dataset is a precleaned version of the pizza_medium dataset. Specifically, the applicants and inventors field have already been cleaned along with corrupted characters and other common cleaning tasks. This makes it easier to work with the data and this dataset is the core dataset in the Manual. As above, note that the Patentscope publication_number field more properly refers to an application number in the absence of a kind code.
pizza_sliced is a set of five .csv files for a search of pizza on espacenet. It is designed to illustrate issues involved in loading multiple files into R. It also illustrates problems with character corruption and the importance of pre-cleaning data before analysis.
pizza_lens_1000 is a raw dataset of 1000 records including the term pizza downloaded from The Lens database. The dataset has not been cleaned.

4.1.2 Patent Landscape Reports datasets

Three datasets are drawn from the WIPO Patent Landscape Reports. The datasets address different topics, present a variety of fields and formats and are different sizes. Each dataset is linked to a detailed patent landscape report that provides an insight into approaches to patent analytics.

ewaste presents the results of research for a report on patent activity for electronic waste recycling and its implications for developing countries.
solar_cooking presents the data supporting a report on technologies that use solar energy as the source for cooking and pasteurizing food.
ritonavir presents the data for a patent report on patent activity for the HIV antiretroviral drug Ritonavir in the field of pharmaceuticals. The dataset illustrates specific activity around issues such as dosage and also the problem of ‘evergreening’ in patent activity.

4.1.3 Other datasets

wipo is a single Excel sheet of data on trends in patent applications and growth rates from the WIPO World Intellectual Property Indicators - 2014 Edition. The data is used for simple graphing in tools such as R and illustrates the need to skip rows when reading data into analytics tools.
WIPO_sequence_data. This dataset contains a small sample of the sequence data from the year 2000 available free of charge from the WIPO Patentscope database. This dataset can be used to explore analysis of patent sequence data.
Synthetic biology. This is a sample of data from Thomson Innovation developed by Paul Oldham for research on patent activity involving synthetic biology. The data has been extensively cleaned in VantagePoint from Search Technology Inc. and is intended to illustrate the use of data from a commercial patent database.

4.1.4 Round Up

The datasets section of the project provides a series of useful training sets from a variety of sources and displaying a variety of features. These are open access datasets that can be used to test different approaches but please credit their sources. More datasets may be added to the online version of the Manual in due course. We are particularly interested in sample data from STN, QuestelOrbit, PATSTAT or other data providers that can be used as training sets.