wikimedia / sentencex

A sentence segmentation library with wide language support optimized for speed and utility.
https://wikimedia.github.io/sentencex/
MIT License
44 stars 6 forks source link

port the library to js #1

Closed santhoshtr closed 11 months ago

santhoshtr commented 12 months ago

Even though pyodide can help running the demo in browser, I don't think that is a replacement for a true js library of this project. I tried running this node program to do segmentation. It works. However, it is slow. the initialization takes around 10 seconds and involves installing sentencex library.

'use strict';

const { loadPyodide } = require( 'pyodide' );

async function initSentencex() {
    const pyodide = await loadPyodide();
    await pyodide.loadPackage( 'micropip' );
    const micropip = pyodide.pyimport( 'micropip' );
    await micropip.install( 'sentencex' );
    return pyodide;
}

const segment = ( language, text ) => {
    // Copy context
    global.language = language;
    global.text = text;
    const segmenterPy = `
        from sentencex import segment
        from js import language, text
        list(segment(language, text))
    `;
    return global.pyodideInstance.runPython( segmenterPy ).toJs();
};

initSentencex().then( ( pyodideInstance ) => {
    global.pyodideInstance = pyodideInstance;
    const text = 'The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb\'s design and development"';
    const sentences = segment( 'en', text );
    console.log( sentences );
} );

So, it would be ideal to port this to an actual js library. Since it does not have any dependencies, and just some regular expressions and language dependent data, I believe that the porting should be easy.

santhoshtr commented 11 months ago

Ported. https://github.com/santhoshtr/sentencex-js