Chapter 1 Introduction

This book provides a practical guide to free and open source software tools for patent analytics. The aim of the WIPO Manual on Open Source Patent Analytics is to provide a practical introduction to patent analytics without assuming prior knowledge of patents or programming languages.

One feature of open source and free software tools is that this area is fast moving. In response to this the Manual is divided into two versions:

  • The electronic version of the Manual which can be updated as tools are updated.
  • A printed reference Manual providing a guide to core tools.

The Manual builds on the experience generated in the development of the WIPO Patent Landscapes on a wide range of topics that serve as key reference works for methods in patent analytics. The Manual is mainly intended for researchers, patent professionals and patent offices in developing countries. However, we expect that it will be of wider interest to researchers and patent professionals.

Patent data is important because it is a valuable source of technical information that can inform decision-making on whether or not to pursue a particular avenue of research and development, whether to license a particular technology, or whether to pursue product development in particular markets. Patent data is also important in economic and policy terms because it provides a key indicator and insight into trends in science and technology. Patent data is commonly used by organisations such as the OECD, EUROSTAT and others to report on trends in research and development. Researchers increasingly use patent data to investigate new and emerging areas of science and technology such as genome editing or climate change adaptation technologies.

Patent activity can also be controversial. Important controversies over the last 20 years include DNA patents, software patents, patents on business methods, the rise of patent ‘trolls’ and the implications of the internationalisation of patent activity for developing countries. The free software and open source movements (based on the flexibilities in copyright law) are in part a response to the controversies that have arisen around proprietary software models involving copyright and patents and a desire to do things differently. This has led to new models for sharing data, cooperation in innovation and new business models. In particular, a wide range of open source and free software tools are now available for research and analysis. This Manual provides an overview of the available tools for patent analysis and explores a small number in greater depth.

1.1 Structure

We will focus on answering two main questions:

  • How to obtain patent data in a form that is useful for different types of analysis?
  • How to tidy, analyse, visualize and share patent data using open source and free software?

In approaching these issues we will organise the Manual and materials into five main topics:

  1. An Overview of Open Source and Free Software Tools
  2. Approaching Patent Data
  3. Obtaining Patent Data
  4. Cleaning and Tidying Patent Data
  5. Analysing and Visualizing Patent Data

As a project focusing on open source and free tools, all data and tools developed for the manual are made available through the GitHub project repository. We encourage you to take a look at the repository. To get started with GitHub and download all materials from the Manual install GitHub and then clone the repository. It’s actually much easier than it sounds.

We will now take a quick look at the background to the topics.

1.1.1 An Overview of Open Source and Free Software Tools

We start the Manual with a core Overview chapter that reviews the ever growing number of open source and free software tools that are available for different steps in the patent analytics process. The sheer number of relevant tools is almost overwhelming and one feature of open source tools is that they all require investments of valuable time to learn how they work. In some cases this may require acquiring programming skills. To assist with decision-making on whether or not to invest in a particular tool we conclude the Overview with a list of 12 questions that you may want to consider. By far the most important of these questions, and the guiding principle informing our selection of tools for the Manual, is: Does this work for me?

1.1.2 Approaching Patent Data

In preparing the Manual we assumed no prior knowledge of the patent system or open source tools. To help you get started a chapter on patent data fields provides a brief introduction to the structure of patent documents and the main data fields that are used in patent analytics.

1.1.3 Obtaining Patent Data

One major challenge in understanding the implications of patent activity, either in fields such as climate change technologies, software, or pharmaceuticals, is accessing and understanding patent data.

Recent years have witnessed a major shift towards the use of open source research tools and the promotion of open access to scientific data along with the promotion of open science. One of the main purposes of the patent system is to make information on inventions available for wider public use. The patent system has responded to this through the creation of publicly accessible databases such as the European Patent Office espacenet database containing millions of patent records from over 90 countries and organisations. WIPO Patentscope, provides access to 52 million patent documents and weekly publications of Patent Cooperation Treaty applications. Others initiatives to make patent data available include Google Patents and The Lens and Free Patents Online. Most of these tools do not require knowledge of programming. However, the European Patent Office Open Patent Services provides free access to raw patent data for those willing to work using an Application Programming Interface (API) and to parse raw XML or JSON data.

In the case of the United States it is possible to bulk download the entire USPTO collection through the Google Bulk Download of USPTO patents. The USPTO has also recently embraced open data through the creation of a new data portal and the Patentsview search database and JSON API. A range of commercial providers such as Thomson Innovation and PatBase, among others, provide access to patent data and, in the case of Thomson Innovation, add additional information through the Derwent World Patent Index. As such there is an ecosystem of patent information sources and providers out there.

As we will see, the key problem confronting patent analysts using free tools is obtaining patent data in the quantity and with the coverage needed, and with the desired fields for analytics purposes. The Manual will walk through the different information services and go into detail on those free services that are the most useful for patent analytics.

1.1.4 Cleaning and Tidying Patent Data

Anyone familiar with working with data will know that the majority of the work is taken up with cleaning data prior to analysis. In particular data from different patent databases typically involves different cleaning challenges. Most of these challenges involve cleaning inventor and applicant names or cleaning text fields prior to analysis.

Two core chapters in the Manual address data cleaning issues. The first is a chapter on Open Refine (formerly Google Refine) which walks through the process for cleaning applicant and inventor names for a sample dataset. The second chapter focuses on the use of R to tidy patent data for an infographic.

In working with the Manual we suggest that you might find the following resources useful. The first addresses the question of how best to prepare for work in analytics and the second addresses key issues in the formatting of data that informs work in the Manual using R and RStudio.

  1. Jeff Leek’s The Elements of Data Analytic Style (available free of charge if required)
  2. Hadley Wickham on Tidy Data and this video

We suggest that you take a look at these papers because they contain core ideas for effective approaches to working with patent data.

1.1.5 Analysing and Visualizing Patent Data

The core questions in patent analysis are: who, what, where, when, how, and with what? The way in which we approach these questions will depend on the goal of the patent analysis. However, in almost all circumstances realising that goal will depend on combinations of answers to the core questions. The visualization of patent data is an essential feature of modern patent analysis. Put simply, humans are better at absorbing visual information than columns and rows of numbers or large numbers of texts.

Two core chapters in the Manual address the visualization of patent data using dashboards with Tableau Public and interactive graphics using Plotly with Excel files or using RStudio. The visualization of networks of applicants, inventors or technologies is a growing feature of patent analytics and we provide a practical walkthrough using the open source software Gephi. With the growing popularity of infographics a core chapter is also provided on preparing data for an infographic using RStudio and the online infographic service infogr.am.

Looking beyond patent analysis and visualization, within the core Manual we include a chapter on how RStudio can be used to access the scientific literature using packages developed by ropensci for accessing the Public Library of Science rplos as an introduction to accessing the wider scientific literature using packages such as fulltext.

1.1.6 Sharing Data and The Writing of the Manual

In writing this Manual we decided at an early stage to use free and open source tools. Our tool of choice was RStudio because it allowed us to write the Manual in markdown (rmarkdown), including images and graphs generated from code, and then easily export the results to Word, .pdf and html. We were also able to easily create a home for the Manual on Github and to use jekyll to release earlier versions of chapters as articles as they were written. As we moved the Manual into its final version we were able to take advantage of the new bookdown package within the mid-2016 preview version of RStudio to turn the Manual into the electronic book you are reading. All of this was free. The only requirement was investment acquiring the knowledge to use the tools.

A key aim behind the development of the Manual was also to make a range of actual patent datasets available that readers could use to experiment with the different tools and follow the Manual as a practical guide. Github proved to be ideal for this particularly with the introduction of large file storage. While these tools were initially unfamiliar, and involved a learning curve, the process proved so easy that the entire Manual was written in rmarkdown inside RStudio and posted to the project development website on Github as it was written.

This combination of tools proved to be a powerful and highly flexible way to share raw data, results and analysis in a way that is transparent and easily accessible to a range of audiences. Furthermore all of the tools are free. While this approach will not suit situations where confidentiality is a key concern, for projects where the results are intended to be public this combination of tools represents a powerful and refreshing solution to the old problem of how to make the results of research available to the widest possible audience for free.