weaviate / how-to-ingest-pdfs-with-unstructured

https://github.com/weaviate-tutorials/how-to-ingest-pdfs-with-unstructured
16 stars 4 forks source link

Reducing the prompt #2

Closed Kate-Lyndegaard closed 1 year ago

Kate-Lyndegaard commented 1 year ago

Hi,

Thank you for providing this tutorial.

I am trying to run the code, which I adapted to use with my data.

Example:


section_pattern = re.compile(r'[A-Z]$|[0-9]+(\.[0-9]+)*$')

class AbstractExtractor:
    def __init__(self):
        self.current_section = None  # Keep track of the current section being processed
        self.current_entry = None  # Keep track of the current entry being processed
        self.have_extracted_abstract = (
            False  # Keep track of whether the abstract has been extracted
        )
        self.in_abstract_section = (
            False  # Keep track of whether we're inside the Abstract section
        )
        self.texts = []  # Keep track of the extracted abstract text

    def process(self, element):
        found = re.match(section_pattern, element.text)
        if found:
            if self.current_entry is not None:
                self.texts.append(self.current_entry)
            self.current_section = "XPlanung" + found.string.replace('.', '_')
            self.current_entry = {}
            self.current_entry[self.current_section] = []
        elif self.current_section is not None:
            self.current_entry[self.current_section].append(element.text)

        return True

My Example PDFs:

testdata.zip

I receive the error: "message': "update vector: connection to: OpenAI API failed with status: 400 error: This model's maximum context length is 8191 tokens, however you requested 9657 tokens (9657 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."}]}". I am using the module config from your example.

Do you have any recommendations on how to reduce the prompt, or update my schema, so that this works?

Kind regards, Kate

hsm207 commented 1 year ago

hey @Kate-Lyndegaard!

the error means the text snippet you are trying to embed is too big for the openai model to handle. you should chunk the text snippets so that it will fit into the model's max input tokens

Kate-Lyndegaard commented 1 year ago

Hi @hsm207 ,

OK, I will try doing that. I found this notebook which explains how to do it: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

Kind regards, Kate