yaqwsx / jlcparts

Better parametric search for components available for JLC PCB assembly
https://yaqwsx.github.io/jlcparts/
MIT License
562 stars 51 forks source link

Scrape datasheets for keywords #17

Open yaqwsx opened 3 years ago

yaqwsx commented 3 years ago

We could download datasheets and scrape chip description from them. However, it is unclear how to perform the extraction. We could use pdftotext and use some heuristics:

yaqwsx commented 3 years ago

We could also use some summarizers as a default option, e.g:

Architecture proposal:extreactDescription(component, datasheetPdf) - this function decides how to extract the description and uses of of approaches. Return a tuple (Short name/description, description text). The short description will be shown in the table. It can also return None to indicate that this component has no description (e.g., passive components).

Note that tools like multi-rake and similar work really badly.

yaqwsx commented 11 months ago

ChatGPT excels at this and yields high-quality summarizations. It also manages to extract Chinese datasheets and can provide keywords and descriptions in English. The proposal is to store such descriptions in a database for full-text search. Also, we can generate both "keywords with parameters for full-text search" and "a single sentence description" that can be shown on the results page, and it would serve the user to recognize the part quickly.

The problem, however, boils down to time and price. Based on the datasheet size, extracting these can cost us between 0.01-0.15 USD per datasheet, as experiments on a dozen datasheets have shown. It takes between 1 to 20 seconds to extract a single datasheet. This means that the total cost to generate these summaries for the whole database of only active components would be approximately:

At the moment, this would be really cool feature, just too expensive for a hobby. And if we ever dive into this, we should spend more time and tweaking the prompt to get the best results.

Some examples; for HD108 datasheet it outputs:

HD108 RGB Pixel LED chip, 5050 package, 16-bit grayscale (65536 levels), Clock Drive Frequency 40MHz. Features: CMOS process, low voltage, low power consumption, high brightness efficiency, pure copper bracket, pure gold wire, two-line synchronization, energy-saving mode. Absolute Maximum Ratings: Power Supply Voltage -0.5 to +5.5V, Logic Input Voltage -0.3 to VDD+0.3V, Working Temperature -20 to +80°C, Storage Temperature -50 to +120°C, ESD Pressure 4kV. Electrical Parameters: Supply voltage 5.0V (max 5.3V), LED current 20mA, clock high/low level width >30ns, data setup time >10ns, PWM frequency >27kHz, static power consumption 0mA, max transfer speed 40MHz.

When instructed to give a human-readable summary, we get:

The HD108 LED is a full color digitally controlled LED light source that combines a control circuit with an RGB LED chip to provide optimal performance for RGB digital dimming control. It has a 2-wire SPI interface and is fully compatible with APA102 LEDs. The HD108 operates at a 40MHz oscillator frequency enabling high-speed data transmission and a PWM refresh rate over 27kHz. It can adjust emitting color with 65,536 gray scale levels (controlled by 16 bits) and 32 brightness adjustment levels (controlled by 5 bits), and does not require decoupling capacitors at the power input. Available in a 6-lead 5050 package, it includes an intelligent digital port data latch, signal reshaping amplification drive circuit, and a programmable constant current driver, ensuring consistent pixel point light color【7†source】.

When given a Chinese datasheet for HR8826:

HR8826 microstepping stepper motor driver; bipolar motors; full/half/quarter/eighth/sixteenth/thirty-second step modes; 38V, ±3A output drive capability; programmable decay modes; synchronous rectification; over-current, over-temperature protection; no special power-on sequencing required; TSSOP-28 package with exposed thermal pad; lead-free; 100% matte tin plating; supports mixed decay mode; 8-38V power supply; built-in 3.3V reference output; up to 32 microstep resolutions supported; internal undervoltage lockout; thermal shutdown circuit; operational limits: load voltage 8-38V, output current ±3A, logic input voltage -0.3 to 7V, sense voltage -0.5 to 0.8V, reference voltage -0.3 to 4V, ambient temperature -20 to +85°C, storage temperature -55 to +150°C; motor power supply voltage 8-38V; recommended operating conditions nominal at 8-38V with max 3.5V REF input, 1mA V3P3OUT load current; electrical characteristics include RDS(ON) 180 mΩ, PWM frequency below 50kHz.