winkjs / wink-eng-lite-web-model

English lite language model for Web Browsers
MIT License
11 stars 8 forks source link

Replace usage of NodeJS buffer in web-model #5

Closed mrbrianevans closed 2 years ago

mrbrianevans commented 2 years ago

In the file https://github.com/winkjs/wink-eng-lite-web-model/blob/master/dist/read-core-model.js#L1 there is usage of the Buffer API only available in Nodejs. This causes an error when used in the browser. (mentioned in https://github.com/winkjs/wink-nlp/issues/67). The precise operation causing the error is loading the base64 encoded string (of lexicon and expansions) from the model JSON file into a UInt32Array. This is done using Buffer.from() (which is not natively available in web browsers). In other issues/discussions you have suggested using browserify to overcome this, however there are many situations where its not practical, where another build tool is already in use for a project. Another solution is to import from 'https://cdn.skypack.dev/wink-eng-lite-web-model', but again this does not fit in with build tools which bundle up vendor dependencies at compile time. This requires loading the module from a CDN for every user, and does not support tree shaking (https://github.com/skypackjs/skypack-cdn/issues/91).

I propose that the nodejs specific code be modified to be browser compatible.

I have looked at the code, and tried to come up with a solution using atob to decode the base64 string. I have not opened a pull request because as far as I can tell, the published code in github is not the source code, but rather a minified build output and I am unable to find the source code (if it is hosted publicly).

const originalModel = require('./languages/cur/models/eng-core-web-model.json')
function u32FromStr(b64encoded) {
  return new Uint32Array(
    atob(b64encoded)
      .split('')
      .map(function (c) {
        return c.charCodeAt(0)
      })
  )
}
const readModel = function () {
  const model = JSON.parse(JSON.stringify(originalModel))
  let packing = model.packing
  let featuresData = model.features
  model.lexicon = u32FromStr(model.lexicon)
  model.xpansions = u32FromStr(model.xpansions) 
  let pos = model.pos
  for (const f in model.packing.layout)
    if (0 === packing.layout[f][3]) {
      featuresData[f].hash = Object.create(null)
      for (let k = 0; k < featuresData[f].list.length; k += 1)
        featuresData[f].hash[featuresData[f].list[k]] = k
    }
  featuresData.lexeme.hash = Object.create(null)
  for (let k = 0; k < featuresData.lexeme.list.length; k += 1)
    featuresData.lexeme.hash[featuresData.lexeme.list[k]] = k
  const clusters = featuresData.posClusters.list
  for (let k = 0; k < clusters.length; k += 1)
    clusters[k] = new Set(clusters[k].split('_').map((e) => pos.hash[e] || 0))
  return model
}

module.exports = readModel

Credit to https://stackoverflow.com/questions/12710001/how-to-convert-uint8-array-to-base64-encoded-string#12713326 for base64 converter.

I have tested this code locally and its not quite right. I think it might be better to use a TextEncoder to do the base64 conversion.

Please could you consider changing the implementation of this to be directly compatible with browsers without a polyfill

sanjayaksaxena commented 2 years ago

Thank you so much @mrbrianevans for such a detailed issue. The release of the code has been long pending at our end.

As the first step, will target to release the code in next couple of days so that you can send the PR using TextEncoder, which we will also review in the mean time.

sanjayaksaxena commented 2 years ago

Have tried couple of options including the TextEncoder; following seems to work:

const bufferFromBase64 = function ( data ) {
  const decodedData = atob( data );
  var size = decodedData.length;
  var bytes = new Uint8Array( size );
  for ( let k = 0; k < size; k += 1 ) {
      bytes[ k ] = decodedData.charCodeAt( k );
  }

  return bytes;
}; // bufferFromBase64()

The above function is equivalent to Buffer.from( data, 'base64' ).

Need to perform detailed testing.

What is your opinion @mrbrianevans?

mrbrianevans commented 2 years ago

Thank you very much for publishing source code.

I will try the function you've put above in the source code and see if I can get it to work. If I do, I'll submit a PR.

mrbrianevans commented 2 years ago

Okay, submitted a PR #6

mrbrianevans commented 2 years ago

Closed by #6

sanjayaksaxena commented 2 years ago

Released the Version 1.3.3 on NPM 👍