Skip to content

🛢 pipelines and transformers to turn startup domains into huge dataframes filled with training data

Notifications You must be signed in to change notification settings

startupradar/transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformers turns API data into ML-ready dataframes

StartupRadar Transformers

This python package allows you to integrate the StartupRadar API directly into your own Data or Machine Learning pipelines. With only a list of domains, you can create huge Pandas DataFrames filled with all the data available on StartupRadar.

Implemented transformers

startupradar.transformers.export

  • Creates a human-readable DataFrame for usage in Excel or Google Spreadsheets (through CSV).

startupradar.transformers.core

Transformers in this module create data from API functionality. All transformers in this module require API access.

  • LinkTransformer: Create columns for all the domains a given domain links to
  • BacklinkTransformer: Create columns for all the domains that link to the given domain
  • DomainTextTransformer: Create a text column with the homepage text of the given domain
  • BacklinkTypeCounter: Counts the types of pages that link to a specific domain

startupradar.transformers.basic

Transformers in this module work with DataFrames and provide useful feature generation on domains. The transformers in this module don't require the API and can be used by anyone.

  • DomainNameTransformer: Extract features from a domain name, currently only top level domain, e.g. com or io
  • CommonStringTransformer: Application of a CountVectorizer to find common strings among passed inputs
  • ColumnPrefixTransformer: Create a DataFrame with the same column names, but prefixed with e.g. prefix_
  • CounterTransformer: Create row-wise Counter objects and distribute keys as columns

startupradar.transformers.pandas

Transformers that re-implement scikit-learn transformers, to also output Pandas DataFrames. These transformers can be used by anyone, no API key necessary.

  • OneHotEncoderDF: One Hot Encoder outputting a dense and consistent DataFrame
  • FeatureUnionDF: Create a FeatureUnion with pd.DataFrames as input and output
  • PipelineDF: Creates a pipeline that retains DataFrames and their column names
  • TfidfVectorizerDF: Adaption of the sklearn transformer
  • CountVectorizerDF: Adaption of the sklearn transformer

Upcoming

Transformers we're thinking about that may be coming soon:

  • something to leverage the similar domains endpoint
  • tfidf of all backlinks or (forward) links combined (domain- or url-level)

How it works

For most transformers, you can simply pass a series of domain names as input. In the case of the DomainNameTransformer, it could look like this:

> import pandas as pd
> from startupradar.transformers.util import DomainNameTransformer
>
> domains = ["loreyventures.com", "startupradar.co", "karllorey.com"]
> domains_series = pd.Series(domains)
> t = DomainNameTransformer()
> t.fit_transform(domains_series)
                   tld
loreyventures.com  com
startupradar.co     co
karllorey.com      com

Install

This is a work in progress. You should expect things to change on a daily basis. If you're still convinced to try it, feel free to check the latest version by installing it as a git-based dependency:

> pip install git+https://github.com/startupradar/transformers.git

About

🛢 pipelines and transformers to turn startup domains into huge dataframes filled with training data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages