This project implements an API to handle URL submissions, track download successes/failures, and provide endpoints to get the top URLs based on the amount of submissions or the latest N submitted. The URLs and their associated data are stored in an in memory map with a doubly linked list for fast accessing for updates as well as maintaining order to query the latest URLs quickly
- Submit URL: Allows submitting a URL and tracks its success/failure.
- Top URLs: Fetches the top N URLs based on different sorting criteria like count or latest submitted.
- Filtering: Supports sorting and limiting the number of URLs returned.
- Endpoint:
/submit-url
- Method:
POST
- Description: Accepts a URL parameter and records its submission. Tracks the number of successes and failures. If a URL has successfully been fetched in a previous request then the count and success/failure and LastSubmitted time will be updated. If this is the first time that specific URL has been submitted and the GET request to fetch it fails. It will not be stored.
- Request Body (JSON):
{ "url": "http://example.com" }
- Response:
{ "message": "url submitted" }
- Example:
curl -X POST -H "Content-Type: application/json" -d '{"url": "http://example.com"}' http://host/submit-url
- Endpoint:
/top-urls
- Method:
GET
- Description: Fetches the top N URLs. The results can be sorted either by the count of accesses (
count
) or by the latest submission time (latest
). - Query Parameters:
sort_by
: Sorting criterion. Valid values are"count"
or"latest"
.get_n
: Number of top URLs to return.
- Example Request:
curl "http://localhost:8080/top-urls?sort_by=count&get_n=50"
- Response (JSON):
[ { "url": "http://example.com", "count": 50 }, { "url": "http://example2.com", "count": 40 } ]
- Invalid
sort_by
: Returns400 Bad Request
if an invalid value is provided for thesort_by
parameter.- Example:
"sort_by": "invalid"
- Example:
- Invalid
get_n
: Returns400 Bad Request
ifget_n
is not a valid integer.- Example:
"get_n": "not-a-number"
- Example:
The application includes a Batch Process that runs periodically to collect and process the top URLs. It fetches the top 50 URLs from the store (by count), refetches them, updates their stats in the store and logs their stats. This process helps monitor URL activity and provides insights into the number of successes, failures, and the last download time for the top URLs.
The behavior of the batch process is controlled by the following parameters from the config.yaml
:
worker_pool_size
: The number of concurrent workers used in processing the URLs.num_of_batch_urls
: The number of top URLs to be collected and processed in each batch.batch_interval_seconds
: The interval (in seconds) between each batch process execution.
URLs are stored in a linked list format to efficiently track and update URLs. Each URL has associated metadata::
- Count: Total number of submissions for the URL.
- Successes: Number of successful downloads for the URL.
- Failures: Number of failed download attempts.
- Last Submitted: Timestamp of the last submission.
The linked list structure allows for O(1) updates when a URL is added or modified.
- Go 1.18+.
-
Clone the repository:
git clone https://github.com/yourusername/url-submission-api.git cd url-submission-api
-
Install dependencies:
go mod tidy
-
Run the application:
go run main.go
The server will start at http://localhost:8080
.
The application uses a YAML configuration file, config.yaml
, to load various settings for the server and downloader. The config file is parsed into a Go struct, and the values are used to configure different aspects of the program's behavior.
The config.yaml
file contains two main sections:
-
server: Configuration for the HTTP server
port
: The port on which the HTTP server will listen for incoming requests. For example,":8080"
will start the server on port 8080.
-
downloader: Configuration for the downloader's behavior
worker_pool_size
: The number of concurrent worker goroutines to use in the downloader's worker pool. This controls how many URLs can be processed concurrently.num_of_batch_urls
: The number of URLs to process in each background batch process.batch_interval_seconds
: The interval, in seconds, between processing URL batches.
server:
port: ":8080"
downloader:
worker_pool_size: 3
num_of_batch_urls: 10
batch_interval_seconds: 10
-
Submit a URL:
curl -X POST -H "Content-Type: application/json" -d '{"url": "http://example.com"}' http://localhost:8080/submit-url
-
Get Top URLs by Count:
curl "http://localhost:8080/top-urls?sort_by=count&get_n=5"
-
Get Top URLs by Latest Submission:
curl "http://localhost:8080/top-urls?sort_by=latest&get_n=5"
- Shut down workers when there's no tasks and spin them back up when necessary
- Use test data rather than executing actual gets
- Make two filter functions and make n not configurable to preallocate slice size and avoid reallocation
- Potentially shard the store (although I saw worse results as sorting didn't seem to have a massive overhead when benchmarking)
- How much do we care about accurate results? Could we batch sorting / processing to reduce overhead on fetching sorted lists
- Better worker pool & HTTP tests, most time spent on the store