This is shared repository for data processing scripts, with a focus on innovation-related data. 'Processing' in this context could refer to a number of different operations, including (but not limited to):
- normalisation
- disambiguation and entity reconciliation
- web scraping
- parsing web-scraped data
- transformation/merging different datasets together
- standardising datasets
- deduplication
If you'd like to link some data processing scripts, or upload some, please take a look at our contribution guidelines, and make a pull request using a pull request template. Links to external repositories are added below; uploaded scripts get their own folder.
Each separate folder here contains a repository of data processing scripts (or, more commonly a link to one plus a description), contributed by a member of the community. Each repository listed here should be documented to a standard that will let you know how and on what to run it. If you have problems with code files that are hosted in this repository directly, please open a github issue, or a pull request if you correct the issue and would like to amend the documentation. If you're having trouble with an external repository that is linked to by a URL, then raise an issue in that repository.
- USPTO Public Data tools (PatentPublicData)
- BigQuery patent data tools (patents-public-data)
- PatentsView in rOpenSci
- Patfam - estimating patent families across sites
- Gephi (Lens example)
- Alaska: A data pipeline benchmark, with profiling data
- the Allen NLP Guide - general-purpose
- linked-uspto-patent-data (rdf), forward43 (social innovation)