Machine Learning: How Innography Drastically Improves Big Data Quality to Give Its Customers an Edge
Early November, Tyron and I had the pleasure of presenting at the American Society for Quality (Austin Section) event, “The Future of Quality.” ASQ is an organization founded to advocate quality. Not specific to any field, discipline, or level within an organization, ASQ emphasizes an all-encompassing dedication to professional integrity and a passion for progress and excellence. As a company so committed to quality ourselves, it was inspiring to be in the presence of such a likeminded group.
Delivering a Compelling Topic for the End of Day
The day-long event covered six presentations ranging in expertise and topics from healthcare, semiconductors, education and product development to IP data software. It was interesting to see how quality assurance is addressed within the different fields, and how the skeleton of the process isn’t all that different, even between vastly different fields (i.e., healthcare vs. IP data software). Innography had the last 3:30 pm slot and given that it was a Friday and the topic was “quality” it was going to be a challenge to keep an audience of about 40 engaged, or so I thought.
Tyron started his presentation titled, “Machine Learning: How Innography Drastically Improves Big Data Quality to Give Its Customers an Edge” by providing an overview of what Innography does, explaining the quality challenges faced, and providing an understanding of how big the role of taming big data is.
His treasure trove of anecdotes and his ability to explain technical concepts in a way easily understood by all got the audience involved right away.
Tyron began with an important distinction between Innography and the rest of the patent software companies: it was founded from a genuine desire to improve the patent process and help customers find new, better, ways to do their work. Tyron shared how as an inventor at IBM, he discovered several flaws in the patent process. He saw a need for improvement in the patent experience, and had ideas about how to do it. And, that’s how Innography was born. The main part of his talk touched upon three quality challenges that are still relevant to Innography today: Big Data, Customer Support, and Algorithms and Visualizations.
Quality Challenge: Big Data
To put IP data in the context of “big data”, Tyron presented some numbers that Innography stores, analyses, and from which insights are derived.
Here are a few of them:
- 225,000,000: PAIR image file wrapper documents
- 57,000,00: Full-Text, translated patent documents
- 18,500,000: Non-patent literature documents
- 16,000,000: Unique, normalized inventor names
- 10,000,000: Company names and variations with normalizations and hierarchies
- 23,000: Pharmaceutical documents
As a platform designed for insight, Innography’s primary job is turning data from over 200 sources into usable, verified, update-to-date quality data. To do that, a weekly ETL (extract, transform, load) process, downloads, cleans, normalizes and correlates patent and related data to make it usable. Only then, are Innography’s tools and software able to provide real insight.
Quality Challenge: Customer Support
One of the critical factors to the success of Innography has been in understanding what customers need or could need, and then creating services, products, solutions, and software around that need. Along with the due diligence to find out what customers need, it is also equally important to follow up to determine if the product offerings DO in fact provide the expected benefits to the customer, to what extent, and what could be improved.
The slides that followed outlined what Innography’s patent analytics software offers to help our customers achieve their goals: data cleansing, correlated data sources, semantic search, proprietary analytics algorithms, over 70 exportable visualizations, automated reports, and the only major patent analytics tool that is available via mobile and desktop. Use cases include areas such as: Research and Innovation, Licensing, Competitive Analysis, Litigation, Acquisition, Risk Management, and Strategy.
The success of Innography’s customer-focused approach was evident in Tyron’s slides showing the many awards won by Innopgrahy and its Net Promoter Score (NPS, conducted in 2015) as compared with other information providers and B2B software. NPS is a customer loyalty metric developed by (and a registered trademark of) Fred Reichheld, Bain & Company, and Satmetrix. NPS ranges from −100, where everybody is a detractor to 100 where everyone surveyed is a promoter. An NPS that is positive is considered to be good, and an NPS of 50 is excellent. TSIA also recently recognized the Innography Client Success team for their efforts in the realm of customer support.
Quality Challenge: Algorithms/Visualizations
The final challenge is developing and maintaining the quality of algorithms and visualizations. Tyron explained this as the challenge of looking at customer needs together with internal knowledge to 1) develop algorithms that work and 2) deliver visualizations that provide necessary and new insights. The slides highlighted 20 algorithms and visualizations, including PatentStrengthTM and CustomStrengthTM, Text Clustering, 4D Visuals, Inventor Resume Analytics, Financial Correlation, and Company Comparison. He also highlighted some of Innography’s key technical core competencies such as 1) use of Amazon’s cloud products for computing, analytics, and auto-scaling, 2) Agile methodology for rapid software development 3) full-stack development team with management averaging 20 years of experience, and 4) machine learning and entity resolution expertise.
For the rest of the presentation I walked through a couple of normalization and matching problems solved in Innography for company and inventor names and then concluded with some best practices for ensuring quality data in the product.
A couple of cartoons (courtesy of Keefe and Wasserman) were shown to illustrate what happens when a product, no matter how revolutionary, fails to deliver on its promise. Customers are not forgiving and neither are companies and management.
At the core of Innography is 1) the correlation of disparate data – patents, litigation, trademarks, companies, people (inventors) and 2) presenting the data so that clients can answer complex questions, such as “What IP does company X own?” “Who are the competitors?” “What are the IP or litigation trends in a certain technology area?” and more.
Company Name Normalization
One of the critical algorithms that helps tie all the data together is our “Company Name Normalization.” As an example, let us walk through a hypothetical scenario of name normalization. The bucket of company names received from patents, trademarks, and litigation looks as illustrated in the slide below. There are misspellings, names with locations appended to them, and abbreviations or acronyms.
The goal is to identify the unique companies in the list. First, we create clusters of similar names. In the above example, all company names with the word “General” would be grouped into one cluster and the same would happen for “Microsoft,” and “AT&T,” including any misspellings. Then—using patent technology areas, patent families, inventors, and timelines—our context-based algorithm would cluster all the different variations under one name: GE is recognized as General Electric, while ATT Advanced Thermal Technologies is removed from the AT&T umbrella. It has taken several iterations of tweaking and learning to reduce errors in company name clustering and matching to achieve one of the highest quality company name normalization capabilities in the industry.
Inventor Name Normalization
A more complex normalization methodology is used when we cluster and identify inventors of intellectual property. Unlike company names, which are typically unique, names of people most often map to several inventors. Think of how many John Smiths there are in the world. Bringing together other contexts — location of inventor, company they worked for, publish date of invention, and co-inventors — help to identify individual inventors. A good example is a well-recognized name and inventor, Bill Gates, or William H. Gates. There were around a dozen William H. Gates in our database, and without the automated inventor name normalization and clustering, it would be a time-intensive, manual effort to separate the inventors with the same names.
Data: Best Practices
With so many data sources, implementing best practices is critical to ensuring quality. The presentation concluded with a brief overview of some of these practices we believe in and strive to follow:
- Evaluate the data and only use it if it passes a checklist. Bad or incorrect data, once in your database, is hard to remove.
- Continuously improve data, technology, algorithms, and processes.
Don’t get complacent — constantly seek to improve.
- Automate where possible. Constantly seek to be efficient.
- Listen to your customers. There is a lot to learn from customers so be proactive and create a captive user base.
Tyron and I enjoyed being part of the Austin section’s American Society for Quality event, and getting a chance to speak about Innography through the lens of quality because it is essentially what drives us — what we do, why we do it, and how we do it.