Open jschanker opened 9 years ago
Yes, it's probably what we're going to do since there's not much else that we could do with it. Also just fyi, the molloy-life-news articles go to the Articles Page. The Events Page will be containing data from this page https://life.molloy.edu/Events which shows upcoming events and all their details
My only issue with that is how coupled the web app would become with the format of that html page. If there was some sort of RSS feed or app with a consistent format, that would be a better option. By parsing the html directly we risk breaking functionality each time the source page DOM structure changes.
I agree with that too, but that's the thing. We don't have that. This is one of the few options left to go with. If not this, then what do we do
@zach111694 this is the one that didn't have an RSS feed correct?
Well this issue is supposed to be on articles page, but yes the link at the first Comme t here contains link to the web page which doesn't have a feed. The events page does have it though, the issue was just posted in the wrong page
@zach111694 ok, so then will we be doing the raw HTML parsing or using that RSS feed that we explored a bit?
On a side note you may want to point @jschanker to the Trello boards if you haven't already.
Obviously we'll use the RSS if it's available. So yes we're using it. I'm saying for the Articles Page which doesn't have it, we do the HTML parsing
Ah right ok so:
Closing since this seems to confront the main issue. Feel free to keep replying if you have any other important things to add.
There are other possible alternatives that should be discussed in terms of viability/feasibility. Allow me to be the devil's advocate in order to provoke thought.
Questions that should be asked/answered:
Our solution should depend on answers to the above questions.
@SantiVargas
It's also worth noting that all this indirectly has the web app act as a sort of central endpoint for accessing all of this data, perhaps in the future this can be used as a REST API for native apps. This way the native app will get its data from something we control directly and therefore can keep the output format consistent even if the root sources, the various Molloy sites, change. If one of those sites changes we just change our implementation server side but still send the same data format to the mobile apps. Just a thought, not really related to the problem at hand however.
Ok. If this is the best route for now, then there are many node modules that can assist us in reading the html and converting it to many formats if wanted.
Regarding coupling and the changing HTML DOM structure for the Articles page potentially breaking the App:
Given such a change, we humans would (hopefully) still be able to make sense of the structure by viewing the source. I think it would be an interesting programming exercise to make the app smart enough to do the same.
With the current implementation, the HTML processor is in a separate module and a separate processor module decides which processor to use (currently only one). The quick and dirty HTML processor I put together, which is tailored to basically only work for the current format, parses the Articles pages' sources and saves the data to a file in JSON. The Articles page gets its data from this file so if the format changes, we can simply refrain from running the processing script (or even better have it run periodically automatically to get updates, but have it stop itself when it recognizes that it may not be able to process the data correctly). The current separation of modules would seem to encapsulate what may change. For example, if we need to change the HTML processor, code for a hypothetical XML processor remains untouched. Alternatively, if we only want to change the logic for deciding which processor should be used, none of the modules performing the format-specific parsing needs to be touched.
Regarding using an existing node module to do the format conversion:
I think it would be more impressive to a potential employer and more rewarding if you all can implement this functionality yourselves. It is certainly true that it's often better to use heavily tested code written by experienced software engineers when working in industry, but as a programming project for developing and promoting your skill sets, I would definitely recommend the do-it-yourself approach. The processor I made is quite lousy for a number of reasons. I wanted to leave making a robust processor that would e.g., fully construct the DOM to you all, if you're interested in doing so.
Reopening since discussion seems to still be going on.
@jschanker Took a look at your branch, I like the idea of having an abstract processor so that we can easily swap out implementations in the future without touching the rest of the code. This should be sufficient for now.
Where you are creating a JSON file for storing the articles I think we might be able to use a caching system instead. I've used node-cache (https://www.npmjs.com/package/node-cache) in projects before so it may be a suitable option, generally if we could avoid going to the filesystem we should. Cache of course is non-persistent between server restarts but I don't think that would be an issue.
As for the template, perhaps using EJS to render it server side would be a better option than the client side script you have in the articles.ejs template. EJS (Embedded JavaScript) can pretty much be used to pre-render templates in a similar way PHP can.
Example I just threw together, may not actually work but basic concept should be illustrated:
<% for (var a in articles) { %>
<h3><%=a["title"]%></h3>
<% if(a["image"]) { %>
<img src="<%=a['image']%>"></img>
<% } %>
<p><%=a["content"]%></p>
<% } %>
Anything between <% %>
will be run as javascript, server-side. Data between <%= %>
will render the value of the data inside, such as variables.
The data a template can use is passed as an object to res.render()
in our express router. For example if we were using the code above as a template we might say
res.render('articles', {articles: arrayOfArticleObjects});
Here arg1 is the template.ejs file itself and arg2 being the data object.
Let me know what you think.
@jcdesimp I agree with you on both points.
Also, just to elaborate a little on what I wrote before, the code that I introduced should definitely be cleaned up at some point. For example, in addition to the processResponseString function being heavily reliant on the current format of the Articles pages to work, it also suffers from quite a bit of code duplication. Regarding the processing capability of the script, there are a number of ideas I have to make it much more adaptive to format changes. For example, if some of the previous content were to remain during a hypothetical structure change, we could have our processing program identify any new markup for headlines, dates, article previews, etc. by locating the HTML elements with the corresponding content.
I think it would be a good idea to get the data by processing the HTML from http://www.molloy.edu/molloy-life/molloy-life-news . Zach was telling me that the consensus was that this approach was hack-y, but to my knowledge, search engine bots process data like this all the time, relying on proper markup by web page creators. I don't think it would be too bad to go with this approach. If nothing else, I think it could be a good academic exercise to construct the DOM tree and extract the relevant data. Thoughts?