Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Commit

Permalink
Use attrs internally for the URL preview code & add documentation. (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
clokep authored Sep 7, 2021
1 parent a23f3ab commit 89ba834
Show file tree
Hide file tree
Showing 5 changed files with 132 additions and 119 deletions.
1 change: 1 addition & 0 deletions changelog.d/10753.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Use `attrs` internally for the URL preview code & update documentation.
2 changes: 1 addition & 1 deletion docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
- [Application Services](application_services.md)
- [Server Notices](server_notices.md)
- [Consent Tracking](consent_tracking.md)
- [URL Previews](url_previews.md)
- [URL Previews](development/url_previews.md)
- [User Directory](user_directory.md)
- [Message Retention Policies](message_retention_policies.md)
- [Pluggable Modules](modules.md)
Expand Down
51 changes: 51 additions & 0 deletions docs/development/url_previews.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
URL Previews
============

The `GET /_matrix/media/r0/preview_url` endpoint provides a generic preview API
for URLs which outputs [Open Graph](https://ogp.me/) responses (with some Matrix
specific additions).

This does have trade-offs compared to other designs:

* Pros:
* Simple and flexible; can be used by any clients at any point
* Cons:
* If each homeserver provides one of these independently, all the HSes in a
room may needlessly DoS the target URI
* The URL metadata must be stored somewhere, rather than just using Matrix
itself to store the media.
* Matrix cannot be used to distribute the metadata between homeservers.

When Synapse is asked to preview a URL it does the following:

1. Checks against a URL blacklist (defined as `url_preview_url_blacklist` in the
config).
2. Checks the in-memory cache by URLs and returns the result if it exists. (This
is also used to de-duplicate processing of multiple in-flight requests at once.)
3. Kicks off a background process to generate a preview:
1. Checks the database cache by URL and timestamp and returns the result if it
has not expired and was successful (a 2xx return code).
2. Checks if the URL matches an oEmbed pattern. If it does, fetch the oEmbed
response. If this is an image, replace the URL to fetch and continue. If
if it is HTML content, use the HTML as the document and continue.
3. If it doesn't match an oEmbed pattern, downloads the URL and stores it
into a file via the media storage provider and saves the local media
metadata.
5. If the media is an image:
1. Generates thumbnails.
2. Generates an Open Graph response based on image properties.
6. If the media is HTML:
1. Decodes the HTML via the stored file.
2. Generates an Open Graph response from the HTML.
3. If an image exists in the Open Graph response:
1. Downloads the URL and stores it into a file via the media storage
provider and saves the local media metadata.
2. Generates thumbnails.
3. Updates the Open Graph response based on image properties.
7. Stores the result in the database cache.
4. Returns the result.

The in-memory cache expires after 1 hour.

Expired entries in the database cache (and their associated media files) are
deleted every 10 seconds. The default expiration time is 1 hour from download.
76 changes: 0 additions & 76 deletions docs/url_previews.md

This file was deleted.

121 changes: 79 additions & 42 deletions synapse/rest/media/v1/preview_url_resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@
import shutil
import sys
import traceback
from typing import TYPE_CHECKING, Any, Dict, Generator, Iterable, Optional, Union
from typing import TYPE_CHECKING, Dict, Generator, Iterable, Optional, Union
from urllib import parse as urlparse

import attr

from twisted.internet.error import DNSLookupError
from twisted.web.server import Request

Expand All @@ -42,6 +44,7 @@
from synapse.rest.media.v1._base import get_filename_from_headers
from synapse.rest.media.v1.media_storage import MediaStorage
from synapse.rest.media.v1.oembed import OEmbedError, OEmbedProvider
from synapse.types import JsonDict
from synapse.util import json_encoder
from synapse.util.async_helpers import ObservableDeferred
from synapse.util.caches.expiringcache import ExpiringCache
Expand Down Expand Up @@ -71,7 +74,43 @@
ONE_HOUR = 60 * 60 * 1000


@attr.s(slots=True, frozen=True, auto_attribs=True)
class MediaInfo:
"""
Information parsed from downloading media being previewed.
"""

# The Content-Type header of the response.
media_type: str
# The length (in bytes) of the downloaded media.
media_length: int
# The media filename, according to the server. This is parsed from the
# returned headers, if possible.
download_name: Optional[str]
# The time of the preview.
created_ts_ms: int
# Information from the media storage provider about where the file is stored
# on disk.
filesystem_id: str
filename: str
# The URI being previewed.
uri: str
# The HTTP response code.
response_code: int
# The timestamp (in milliseconds) of when this preview expires.
expires: int
# The ETag header of the response.
etag: Optional[str]


class PreviewUrlResource(DirectServeJsonResource):
"""
Generating URL previews is a complicated task which many potential pitfalls.
See docs/development/url_previews.md for discussion of the design and
algorithm followed in this module.
"""

isLeaf = True

def __init__(
Expand Down Expand Up @@ -219,18 +258,17 @@ async def _do_preview(self, url: str, user: str, ts: int) -> bytes:

logger.debug("got media_info of '%s'", media_info)

if _is_media(media_info["media_type"]):
file_id = media_info["filesystem_id"]
if _is_media(media_info.media_type):
file_id = media_info.filesystem_id
dims = await self.media_repo._generate_thumbnails(
None, file_id, file_id, media_info["media_type"], url_cache=True
None, file_id, file_id, media_info.media_type, url_cache=True
)

og = {
"og:description": media_info["download_name"],
"og:image": "mxc://%s/%s"
% (self.server_name, media_info["filesystem_id"]),
"og:image:type": media_info["media_type"],
"matrix:image:size": media_info["media_length"],
"og:description": media_info.download_name,
"og:image": f"mxc://{self.server_name}/{media_info.filesystem_id}",
"og:image:type": media_info.media_type,
"matrix:image:size": media_info.media_length,
}

if dims:
Expand All @@ -240,42 +278,41 @@ async def _do_preview(self, url: str, user: str, ts: int) -> bytes:
logger.warning("Couldn't get dims for %s" % url)

# define our OG response for this media
elif _is_html(media_info["media_type"]):
elif _is_html(media_info.media_type):
# TODO: somehow stop a big HTML tree from exploding synapse's RAM

with open(media_info["filename"], "rb") as file:
with open(media_info.filename, "rb") as file:
body = file.read()

encoding = get_html_media_encoding(body, media_info["media_type"])
og = decode_and_calc_og(body, media_info["uri"], encoding)
encoding = get_html_media_encoding(body, media_info.media_type)
og = decode_and_calc_og(body, media_info.uri, encoding)

# pre-cache the image for posterity
# FIXME: it might be cleaner to use the same flow as the main /preview_url
# request itself and benefit from the same caching etc. But for now we
# just rely on the caching on the master request to speed things up.
if "og:image" in og and og["og:image"]:
image_info = await self._download_url(
_rebase_url(og["og:image"], media_info["uri"]), user
_rebase_url(og["og:image"], media_info.uri), user
)

if _is_media(image_info["media_type"]):
if _is_media(image_info.media_type):
# TODO: make sure we don't choke on white-on-transparent images
file_id = image_info["filesystem_id"]
file_id = image_info.filesystem_id
dims = await self.media_repo._generate_thumbnails(
None, file_id, file_id, image_info["media_type"], url_cache=True
None, file_id, file_id, image_info.media_type, url_cache=True
)
if dims:
og["og:image:width"] = dims["width"]
og["og:image:height"] = dims["height"]
else:
logger.warning("Couldn't get dims for %s", og["og:image"])

og["og:image"] = "mxc://%s/%s" % (
self.server_name,
image_info["filesystem_id"],
)
og["og:image:type"] = image_info["media_type"]
og["matrix:image:size"] = image_info["media_length"]
og[
"og:image"
] = f"mxc://{self.server_name}/{image_info.filesystem_id}"
og["og:image:type"] = image_info.media_type
og["matrix:image:size"] = image_info.media_length
else:
del og["og:image"]
else:
Expand All @@ -301,17 +338,17 @@ async def _do_preview(self, url: str, user: str, ts: int) -> bytes:
# store OG in history-aware DB cache
await self.store.store_url_cache(
url,
media_info["response_code"],
media_info["etag"],
media_info["expires"] + media_info["created_ts"],
media_info.response_code,
media_info.etag,
media_info.expires + media_info.created_ts_ms,
jsonog,
media_info["filesystem_id"],
media_info["created_ts"],
media_info.filesystem_id,
media_info.created_ts_ms,
)

return jsonog.encode("utf8")

async def _download_url(self, url: str, user: str) -> Dict[str, Any]:
async def _download_url(self, url: str, user: str) -> MediaInfo:
# TODO: we should probably honour robots.txt... except in practice
# we're most likely being explicitly triggered by a human rather than a
# bot, so are we really a robot?
Expand Down Expand Up @@ -423,18 +460,18 @@ async def _download_url(self, url: str, user: str) -> Dict[str, Any]:
# therefore not expire it.
raise

return {
"media_type": media_type,
"media_length": length,
"download_name": download_name,
"created_ts": time_now_ms,
"filesystem_id": file_id,
"filename": fname,
"uri": uri,
"response_code": code,
"expires": expires,
"etag": etag,
}
return MediaInfo(
media_type=media_type,
media_length=length,
download_name=download_name,
created_ts_ms=time_now_ms,
filesystem_id=file_id,
filename=fname,
uri=uri,
response_code=code,
expires=expires,
etag=etag,
)

def _start_expire_url_cache_data(self):
return run_as_background_process(
Expand Down Expand Up @@ -580,7 +617,7 @@ def get_html_media_encoding(body: bytes, content_type: str) -> str:

def decode_and_calc_og(
body: bytes, media_uri: str, request_encoding: Optional[str] = None
) -> Dict[str, Optional[str]]:
) -> JsonDict:
"""
Calculate metadata for an HTML document.
Expand Down

0 comments on commit 89ba834

Please sign in to comment.