Patent Database Quality
What is data quality?
Wikipedia says “data is generally considered high quality if it is fit for its intended uses in operations, decision making and planning.”
Patent analysis tools are meant to fulfill uses from the very high-level – deriving insights from large groups of relevant patents – to the very detailed. All tools start with the same sources of data: the patent offices worldwide. But that data has millions of errors, is inconsistent and often incomplete.
For example, in the USPTO data there are over 100 different ways that International Business Machines is spelled in patent grants – and applications have many more misspellings. This doesn’t count the over 1,000 additional variations that also include the name of the IBM subsidiary or inventor in the assignee field.
Data quality is the foundation of efficient analysis, valid insights, and informed decisions. Innography’s investment in data quality has generated an industry-leading set of practices to ensure the highest data quality available:
- Correction: Tens of millions of data elements are fixed, such as company name misspellings (see sample IBM misspellings in the sidebar) and data in the wrong fields. This saves hours of time for analysts who don’t have to search for misspellings for each analysis and hope they found them all.
- Normalization: Company subsidiaries often own patents, and Innography recognizes the company hierarchy for over 100,000 companies globally. When examining another company’s portfolio, it’s critical to be able to see their entire portfolio right away, which is what Innography provides.
- Correlation: Many other data sources are incorporated into the Innography solution, and are automatically correlated with patents – such as litigation, trademarks, and company financials. So if you’re looking at a patent set and you want to know which of these patents have been litigated, it’s one click away.
- Completion: Innography “fills” empty data elements where possible. For example, CPC codes weren’t assigned to US patents until 2011, so we add the appropriate CPC code based on the other classifications (IPC and UPC). Also, many patent applications don’t have the company assignee in the record, so Innography infers the company owner where possible based on the inventors and their past employers – over a half-million applications are algorithmically assigned this way.
- Calculation: Some data elements simply don’t exist in the patent record, such as expiration date. Innography calculates the expiration date based on the country’s rules and exceptions, such as (for the US) patent term adjustments, terminal disclaimers, and congressional patent extensions.
- Updates: All this data is automatically updated at least weekly, incorporating new transaction data such as re-assignments, legal status updates, and company acquisitions – over 200,000 updates are applied each month.
New data and all data changes have to go through all the above steps in order to be accepted in the Innography “clean room” database, as shown in this graphic. Innography does all of these steps to help practitioners create the most reliable analyses possible.
How does Innography do this?
Built on a thoroughly modern architecture and with the latest Big Data technologies over the last 8 years, Innography’s data-cleansing capabilities are unique in the industry.
The InnSight rules engine has over 10 million data-correction rules that are applied to every update, and utilizes machine learning to continually improve the resulting data quality. In addition, Innography utilizes crowd-sourcing to find exceptions and generate new rules that only a person can decipher, in a “virtuous cycle” that increases accuracy as more people use it.
All this is overseen by our team of data scientists, who are charged with creating the cleanest data sets possible, while constantly adding new data sources and updating their technology toolsets.
Why does this matter?
Many patent analyses start with the question, “What patents are in this company’s portfolio?” Or a set of relevant patents is grouped by the patent owner to see what companies are participating in a certain technology area.
Without the data cleansing described above, this simple question can take hours or days of data cleanup to fix company misspellings, roll up subsidiaries, apply re-assignment updates and determine which patents and applications have expired. And then the analyst is still never sure that they got all the exceptions.
With Innography, analysts can spend time on the actual analysis rather than cleaning up data. Many new users state that they can perform analyses many times faster with Innography, and are even more confident in the end results.