yawik / SimpleImport

Simple Job Import Module. Imports job openings into YAWIK
MIT License
0 stars 1 forks source link

Do not fetch remote content, if provided via source #10

Closed mbo-s closed 6 years ago

mbo-s commented 6 years ago

At the moment the plaintext for a remote job is always fetched by a remote GET

https://github.com/yawik/SimpleImport/blob/master/src/CrawlerProcessor/JobProcessor.php#L175-L187

This does not work if the remote site loads the job content via javascript or use an iframe

If the remotedata contains the needed templateValues http://scrapy-docs.yawik.org/build/html/guidelines/format.html take this, otherwise use remotefetch

"templateValues":{ "description": "<p>We're a good company<\/p>", "tasks":"<b>Your Tasks<\/b><ul><li>Task 1<\/li><li>Task2<\/li><\/ul>", "requirements":"<b>Qualifications<\/b><ul><li>requirement 1<\/li><li>requirement 2<\/li<<\/ul>", "benefits":"<b>We offer<\/b><ul><li>offer 1<\/li><li>offer 2<\/li><\/ul>", "html": "<p>complete html<\/p>" }

something like

$data = $importData['templateValues']; if $data['html'] isset and notempty: $plainText = prettify($data['html']); elseif concatenate (description, tasks, requirements, benefits) is not empty: $plainText = prettify($data['html']) else $plainText = remotefetch(url)

and prettify(html) should remove all html-tags

cbleek commented 6 years ago

hi @fedys

do you have time to take a look on this issue?

fedys commented 6 years ago

Hi @cbleek,

sadly, I don't. I am too busy these days.