Chapter 2 Introduction
This book provides a practical guide to free and open source software tools for patent analytics. The aim of the WIPO Manual on Open Source Patent Analytics is to provide a practical introduction to patent analytics without assuming prior knowledge of patents or programming languages.
One feature of open source and free software tools is that this area is fast moving. In response to this the Manual is divided into two versions:
- The electronic version of the Manual which can be updated as tools are updated.
- A printed reference Manual providing a guide to core tools.
The Manual builds on the experience generated in the development of the WIPO Patent Landscapes on a wide range of topics that serve as key reference works for methods in patent analytics. The Manual is mainly intended for researchers, patent professionals and patent offices in developing countries. However, we expect that it will be of wider interest to researchers and patent professionals.
Patent data is important because it is a valuable source of technical information that can inform decision-making on whether or not to pursue a particular avenue of research and development, whether to license a particular technology, or whether to pursue product development in particular markets. Patent data is also important in economic and policy terms because it provides a key indicator and insight into trends in science and technology. Patent data is commonly used by organisations such as the OECD, EUROSTAT and others to report on trends in research and development. Researchers increasingly use patent data to investigate new and emerging areas of science and technology such as genome editing or climate change adaptation technologies.
Patent activity can also be controversial. Important controversies over the last 20 years include DNA patents, software patents, patents on business methods, the rise of patent ‘trolls’ and the implications of the internationalisation of patent activity for developing countries. The free software and open source movements (based on the flexibilities in copyright law) are in part a response to the controversies that have arisen around proprietary software models involving copyright and patents and a desire to do things differently. This has led to new models for sharing data, cooperation in innovation and new business models. In particular, a wide range of open source and free software tools are now available for research and analysis. This Manual provides an overview of the available tools for patent analysis and explores a small number in greater depth.
We will focus on answering two main questions:
- How to obtain patent data in a form that is useful for different types of analysis?
- How to tidy, analyse, visualize and share patent data using open source and free software?
In approaching these issues we will organise the Manual and materials into five main topics:
- An Overview of Open Source and Free Software Tools
- Approaching Patent Data
- Obtaining Patent Data
- Cleaning and Tidying Patent Data
- Analysing and Visualizing Patent Data
As a project focusing on open source and free tools, all data and tools developed for the manual are made available through the GitHub project repository. We encourage you to take a look at the repository. To get started with GitHub and download all materials from the Manual install GitHub and then clone the repository. It’s actually much easier than it sounds.
We will now take a quick look at the background to the topics.
2.1.1 An Overview of Open Source and Free Software Tools
We start the Manual with a core Overview chapter that reviews the ever growing number of open source and free software tools that are available for different steps in the patent analytics process. The sheer number of relevant tools is almost overwhelming and one feature of open source tools is that they all require investments of valuable time to learn how they work. In some cases this may require acquiring programming skills. To assist with decision-making on whether or not to invest in a particular tool we conclude the Overview with a list of 12 questions that you may want to consider. By far the most important of these questions, and the guiding principle informing our selection of tools for the Manual, is: Does this work for me?
2.1.2 Approaching Patent Data
In preparing the Manual we assumed no prior knowledge of the patent system or open source tools. To help you get started a chapter on patent data fields provides a brief introduction to the structure of patent documents and the main data fields that are used in patent analytics.
2.1.3 Obtaining Patent Data
One major challenge in understanding the implications of patent activity, either in fields such as climate change technologies, software, or pharmaceuticals, is accessing and understanding patent data.
Recent years have witnessed a major shift towards the use of open source research tools and the promotion of open access to scientific data along with the promotion of open science. One of the main purposes of the patent system is to make information on inventions available for wider public use. The patent system has responded to this through the creation of publicly accessible databases such as the European Patent Office espacenet database containing millions of patent records from over 90 countries and organisations. WIPO Patentscope, provides access to 52 million patent documents and weekly publications of Patent Cooperation Treaty applications. Others initiatives to make patent data available include Google Patents and The Lens and Free Patents Online. Most of these tools do not require knowledge of programming. However, the European Patent Office Open Patent Services provides free access to raw patent data for those willing to work using an Application Programming Interface (API) and to parse raw XML or JSON data.
In the case of the United States it is possible to bulk download the entire USPTO collection through the Google Bulk Download of USPTO patents. The USPTO has also recently embraced open data through the creation of a new data portal and the Patentsview search database and JSON API. A range of commercial providers such as Thomson Innovation and PatBase, among others, provide access to patent data and, in the case of Thomson Innovation, add additional information through the Derwent World Patent Index. As such there is an ecosystem of patent information sources and providers out there.
As we will see, the key problem confronting patent analysts using free tools is obtaining patent data in the quantity and with the coverage needed, and with the desired fields for analytics purposes. The Manual will walk through the different information services and go into detail on those free services that are the most useful for patent analytics.
2.1.4 Cleaning and Tidying Patent Data
Anyone familiar with working with data will know that the majority of the work is taken up with cleaning data prior to analysis. In particular data from different patent databases typically involves different cleaning challenges. Most of these challenges involve cleaning inventor and applicant names or cleaning text fields prior to analysis.
Two core chapters in the Manual address data cleaning issues. The first is a chapter on Open Refine (formerly Google Refine) which walks through the process for cleaning applicant and inventor names for a sample dataset. The second chapter focuses on the use of R to tidy patent data for an infographic.
In working with the Manual we suggest that you might find the following resources useful. The first addresses the question of how best to prepare for work in analytics and the second addresses key issues in the formatting of data that informs work in the Manual using R and RStudio.
- Jeff Leek’s The Elements of Data Analytic Style (available free of charge if required)
- Hadley Wickham on Tidy Data and this video
We suggest that you take a look at these papers because they contain core ideas for effective approaches to working with patent data.
2.1.5 Analysing and Visualizing Patent Data
The core questions in patent analysis are: who, what, where, when, how, and with what? The way in which we approach these questions will depend on the goal of the patent analysis. However, in almost all circumstances realising that goal will depend on combinations of answers to the core questions. The visualization of patent data is an essential feature of modern patent analysis. Put simply, humans are better at absorbing visual information than columns and rows of numbers or large numbers of texts.
Two core chapters in the Manual address the visualization of patent data using dashboards with Tableau Public and interactive graphics using Plotly with Excel files or using RStudio. The visualization of networks of applicants, inventors or technologies is a growing feature of patent analytics and we provide a practical walkthrough using the open source software Gephi. With the growing popularity of infographics a core chapter is also provided on preparing data for an infographic using RStudio and the online infographic service infogr.am.
Looking beyond patent analysis and visualization, within the core Manual we include a chapter on how RStudio can be used to access the scientific literature using packages developed by ropensci for accessing the Public Library of Science
rplos as an introduction to accessing the wider scientific literature using packages such as