Processing Molloy Events page

jschanker commented 9 years ago

I think it would be a good idea to get the data by processing the HTML from http://www.molloy.edu/molloy-life/molloy-life-news . Zach was telling me that the consensus was that this approach was hack-y, but to my knowledge, search engine bots process data like this all the time, relying on proper markup by web page creators. I don't think it would be too bad to go with this approach. If nothing else, I think it could be a good academic exercise to construct the DOM tree and extract the relevant data. Thoughts?

zach111694 commented 9 years ago

Yes, it's probably what we're going to do since there's not much else that we could do with it. Also just fyi, the molloy-life-news articles go to the Articles Page. The Events Page will be containing data from this page https://life.molloy.edu/Events which shows upcoming events and all their details

jcdesimp commented 9 years ago

My only issue with that is how coupled the web app would become with the format of that html page. If there was some sort of RSS feed or app with a consistent format, that would be a better option. By parsing the html directly we risk breaking functionality each time the source page DOM structure changes.

zach111694 commented 9 years ago

I agree with that too, but that's the thing. We don't have that. This is one of the few options left to go with. If not this, then what do we do

jcdesimp commented 9 years ago

@zach111694 this is the one that didn't have an RSS feed correct?

zach111694 commented 9 years ago

Well this issue is supposed to be on articles page, but yes the link at the first Comme t here contains link to the web page which doesn't have a feed. The events page does have it though, the issue was just posted in the wrong page

jcdesimp commented 9 years ago

@zach111694 ok, so then will we be doing the raw HTML parsing or using that RSS feed that we explored a bit?

On a side note you may want to point @jschanker to the Trello boards if you haven't already.

zach111694 commented 9 years ago

Obviously we'll use the RSS if it's available. So yes we're using it. I'm saying for the Articles Page which doesn't have it, we do the HTML parsing

jcdesimp commented 9 years ago

Ah right ok so:

RSS feed for events
Page parsing for Articles

Closing since this seems to confront the main issue. Feel free to keep replying if you have any other important things to add.

SantiagoVargas commented 9 years ago

There are other possible alternatives that should be discussed in terms of viability/feasibility. Allow me to be the devil's advocate in order to provoke thought.

Questions that should be asked/answered:

Do we need to mirror that page exactly? Are we providing a mobile site? A copy? What are we providing?
Can we obtain article content any other way? -- Molloy Media is a public organization that any student can be part of. Its content is student created, and its content should be governed similarly. Therefore there should be many ways of accessing/updating the content.
Can we intercept article content before, during, or after publication. Do we only need to show article content pulled from the Molloy website?

Our solution should depend on answers to the above questions.

jcdesimp commented 9 years ago

@SantiVargas

We're not trying to mirror the page, but we need the data contained on the page, that being the events/articles names, descriptions, content, etc. The events page seems to have an RSS feed we can tap into, but the articles page doesn't have any API or feed to easily obtain the data. As of now the only option seems to be requesting the page and parsing the html to get the data. I'm not ecstatic about this option but if it's the only way then we will do what we can to optimize it. Crawling the page on a schedule and caching the results is one way of working towards optimizing this.
I agree completely, but for now there doesn't seem to be. If you guys know the right people to talk to to possibly come up with a better alternative then that would be awesome.
Also a possibility, but an auto updating source would still be the best option. For now it seems the primary purpose of this app is aggregating data from other sources into a single, mobile-optimized place, so if that's the path we're going to take that we should probably avoid managing any content ourselves.

It's also worth noting that all this indirectly has the web app act as a sort of central endpoint for accessing all of this data, perhaps in the future this can be used as a REST API for native apps. This way the native app will get its data from something we control directly and therefore can keep the output format consistent even if the root sources, the various Molloy sites, change. If one of those sites changes we just change our implementation server side but still send the same data format to the mobile apps. Just a thought, not really related to the problem at hand however.

SantiagoVargas commented 9 years ago

Ok. If this is the best route for now, then there are many node modules that can assist us in reading the html and converting it to many formats if wanted.

jschanker commented 9 years ago

Regarding coupling and the changing HTML DOM structure for the Articles page potentially breaking the App:

Given such a change, we humans would (hopefully) still be able to make sense of the structure by viewing the source. I think it would be an interesting programming exercise to make the app smart enough to do the same.

With the current implementation, the HTML processor is in a separate module and a separate processor module decides which processor to use (currently only one). The quick and dirty HTML processor I put together, which is tailored to basically only work for the current format, parses the Articles pages' sources and saves the data to a file in JSON. The Articles page gets its data from this file so if the format changes, we can simply refrain from running the processing script (or even better have it run periodically automatically to get updates, but have it stop itself when it recognizes that it may not be able to process the data correctly). The current separation of modules would seem to encapsulate what may change. For example, if we need to change the HTML processor, code for a hypothetical XML processor remains untouched. Alternatively, if we only want to change the logic for deciding which processor should be used, none of the modules performing the format-specific parsing needs to be touched.

Regarding using an existing node module to do the format conversion:

I think it would be more impressive to a potential employer and more rewarding if you all can implement this functionality yourselves. It is certainly true that it's often better to use heavily tested code written by experienced software engineers when working in industry, but as a programming project for developing and promoting your skill sets, I would definitely recommend the do-it-yourself approach. The processor I made is quite lousy for a number of reasons. I wanted to leave making a robust processor that would e.g., fully construct the DOM to you all, if you're interested in doing so.

jcdesimp commented 9 years ago

Reopening since discussion seems to still be going on.

@jschanker Took a look at your branch, I like the idea of having an abstract processor so that we can easily swap out implementations in the future without touching the rest of the code. This should be sufficient for now.

Where you are creating a JSON file for storing the articles I think we might be able to use a caching system instead. I've used node-cache (https://www.npmjs.com/package/node-cache) in projects before so it may be a suitable option, generally if we could avoid going to the filesystem we should. Cache of course is non-persistent between server restarts but I don't think that would be an issue.

As for the template, perhaps using EJS to render it server side would be a better option than the client side script you have in the articles.ejs template. EJS (Embedded JavaScript) can pretty much be used to pre-render templates in a similar way PHP can.

Example I just threw together, may not actually work but basic concept should be illustrated:

<% for (var a in articles) { %>
    <h3><%=a["title"]%></h3>
    <% if(a["image"]) { %>
        <img src="<%=a['image']%>"></img>
    <% } %>
    <p><%=a["content"]%></p>
<% } %>

Anything between <% %> will be run as javascript, server-side. Data between <%= %> will render the value of the data inside, such as variables.

The data a template can use is passed as an object to res.render() in our express router. For example if we were using the code above as a template we might say

res.render('articles', {articles: arrayOfArticleObjects});

Here arg1 is the template.ejs file itself and arg2 being the data object.

Let me know what you think.

jschanker commented 9 years ago

@jcdesimp I agree with you on both points.

Also, just to elaborate a little on what I wrote before, the code that I introduced should definitely be cleaned up at some point. For example, in addition to the processResponseString function being heavily reliant on the current format of the Articles pages to work, it also suffers from quite a bit of code duplication. Regarding the processing capability of the script, there are a number of ideas I have to make it much more adaptive to format changes. For example, if some of the previous content were to remain during a hypothetical structure change, we could have our processing program identify any new markup for headlines, dates, article previews, etc. by locating the HTML elements with the corresponding content.

zach111694 / molloyLife

Processing Molloy Events page #5