Provide inline URL preview

ulope commented 8 years ago

Please provide an inline link preview similar to the image preview. This is very useful in Slack for example.

timabbott commented 8 years ago

Is what you have in mind a generic "headless browser takes a screenshot of the website" thing, not an integration specific to the site in question?

ulope commented 8 years ago

No, they're using some form of automatic content extraction.

Slack's documentation on that feature

Some examples:

example1 example2

In the documentation it reads as if they only support pages that contain oEmbed or OpenGraph tags but that is definitely not the case.

For Zulip I'd think that using the readability algorithm should provide pretty good results.

timabbott commented 8 years ago

OK cool. Yeah, this would be a great feature to add.

lfaraone commented 8 years ago

We previously used embed.ly for this, but it was removed. I would be adverse to re-adding because it was:

closed source
slow (if done server-side) or messed with rendering (if done via their client-side JS)

But if there's some library for content extraction that has reasonable behaviour, we could integrate it.

timabbott commented 7 years ago

This article has pretty good information on what Slack's algorithm is for deciding what to preview, and https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254#.v33uhgybz

This blog post is also useful for understanding what people are recommending sites do: https://blog.kissmetrics.com/open-graph-meta-tags/

I think step 1 on this front would be to just add code to the Zulip link processor to (e.g.) just fetch open graph data and display it nicely, and then add the other sources as we go. The fetch_open_graph_image method in bugdown may have most of the logic for doing this already; we just currently only call that method for links to Dropbox (and the dropbox stuff is old code and may not work).

So we can do this, potentially fairly easily, by tweaking the InlineInterestingLinkProcessor class in the Zulip markdown processor, and in particular the run method of that class. You can see that we already do something nice for (1) twitter messages (assuming it's configured properly) and (2) youtube links and have code for (3) dropbox, though that code I think isn't active right now.

One can test in a development environment by sending messages, and then add tests to zerver.tests.test_bugdown.

gearheart commented 7 years ago

We've implemented oembed and other ways of data retrieval, now we need to run this thing asynchronously.

1

One way to implement an async solution is following:

introduce renderer param which will specify if urls should be processed
when a new message is received - render it without processing urls and launch async worker task
in async worker task, run the same renderer, but enable url processing this time. which will basically work as in our first version, but asynchronously
when re-rendered - notify clients
later - introduce caching

We are a little stuck on first point. Is there a way to pass a parameter to renderer for specific render call? There are some global parameters and domains, but I’m not sure if that’s what we need.

2

Another approach is to make renderer always process urls and initiate async worker when faced with uncached url. The worker will populate the cache and will initiate re-render that will now get url data from the cache instead of triggering workers. I think I like this approach more, but we need to know - where to store urls cache?

Historical data and other system calls

One thing that bothers me with either approach is the fact that rendering might be triggered from different parts of the project. For example re-rendering historical data. I’m not sure that it’s a good idea to process urls in such situations. What do you think?

Design

Also, we need designs :)

timabbott commented 7 years ago

The design in 2 makes sense to me. python-markdown doesn't have great support for passing parameters to the renderer; the hack we've been using to manage that is the db_data and current_message globals; I would probably add a third global for this purpose.

Message rendering is already done by an async worker process (zerver/worker/queue_processors.py, search for "message_sender"), so I think we can have the async worker just do 2 rendering passes. One detail: for messages sent by bots (user_profile.is_bot), I think we only want 1 rendering pass (i.e. only display the message once it's fully rendered).

As a sidenote, see https://github.com/zulip/zulip/issues/2004 for basically the rendering issue in our existing Twitter cards support.

For caching, the fetch_tweet_data method uses the @cache_with_key decorator to cache results in the third_party_api_results table; that cache isn't quite designed properly (in that it doesn't namespace the IDs as e.g. "twitter:") but we should be able to use that caching mechanism for the URL fetches. The tweet cache is for precisely this purpose: we want to be able to rerender historical messages without having to pay the perf penalty (and risk of changing the content) of the relevant site.

Obviously we want to be smart and only do 1 rendering pass in the event that we have the data needed to render the message in cache.

For designs, we're generally going for something similar to our twitter cards or inline image/youtube previews in size -- I'm not a fan of how giant image previews are on sites like Facebook Messenger or Slack (it destroys the ability to have a conversation). But their open graph previews, which seem to be 3 lines of text in size, are pretty reasonable. If you put something in that sort of standard card layout, I can just play with optimizing the design piece when we merge it, but I'll ping our designer about doing a design for this as well.

gearheart commented 7 years ago

Hi @timabbott

We have a sync version working (https://github.com/zulip/zulip/compare/master...TigorC:url_embed), but have stuck a little bit on the async version. You suggested to do loading in the same worker and avoid unnecessary parsing.

As far as I understand, do_send_messages does rendering and saving of messages. We need to add queries and re-render there.

There are lines:

    for message in messages:
        #...
        rendered_content = render_incoming_message(message, …)

We parse and can detect if there are any links that require external queries inside of rendering function. We need to pass this information from inside of rendering engine to the do_send_messages level, so that we could make queries and trigger re-render if needed.

I couldn’t find any clean ways to do it, but I’ve noticed that some rendering functions assign attributes to message object. It’s somewhat hacky, but we can dothat as well: rendering function will try to get urls from the cache and if it’s not there - it will add it to a list of unrendered urls in message's attribute. do_send_messages will later go through this list and make queries and re-render.

What do you think about it?

Also, you suggested to use do_update_message to notify the client of updates, but it modifies edit history, updates message version, etc. Should we do all of this?

Thanks

timabbott commented 7 years ago

Re: do_update_message, no, we don't want to modify edit history; I think we'll want to write a new function that uses the same logic to update caches and call send_event as do_update_message, but creates it own slightly event (maybe slightly tweaked to add a flag so the frontend knows this is an "add_preview" style update event, or whatever). Via server_events.js, this ends up resulting in update_messages in message_store.js being called, and we'll need to add some conditionals or something in that code path to still do the rerendering, but not tag the message as edited (etc.).

Returning a value from bugdown via dropping data onto the message object is our current approach for doing that flow, so feel free to use that same approach for this.

timabbott commented 7 years ago

This was implemented in c93f1d4eda71ddcca5ba05d0bf9fbfc8895829e2!

It needs a bit more work and manual testing before I'm willing to turn it on by default, which is discussed in the #2133 discussion. We can close this issue as resolved once the feature is ready to be enabled by default.

timabbott commented 7 years ago

Closing this; #4645 has follow-up issues.

zulip / zulip

Provide inline URL preview #406

1

2

Historical data and other system calls

Design