weaviate / contextionary

Weaviate's own language vectorizer, which allows for semantic context-based searches in Weaviate
https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-contextionary
BSD 3-Clause "New" or "Revised" License
14 stars 2 forks source link

panic: runtime error: slice bounds out of range #64

Open nikhilweee opened 1 year ago

nikhilweee commented 1 year ago

Use the following docker-compose.yml to spin up weaviate and contextionary.

docker-compose.yml ```yml # docker-compose.yml --- version: "3.4" services: weaviate: command: - --host - 0.0.0.0 - --port - "8080" - --scheme - http image: semitechnologies/weaviate:1.20.3 ports: - 8080:8080 restart: on-failure:0 environment: QUERY_DEFAULTS_LIMIT: 25 CONTEXTIONARY_URL: contextionary:9999 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true" PERSISTENCE_DATA_PATH: "/var/lib/weaviate" DEFAULT_VECTORIZER_MODULE: "text2vec-contextionary" ENABLE_MODULES: "text2vec-contextionary" CLUSTER_HOSTNAME: "node1" contextionary: environment: OCCURRENCE_WEIGHT_LINEAR_FACTOR: 0.75 EXTENSIONS_STORAGE_MODE: weaviate EXTENSIONS_STORAGE_ORIGIN: http://weaviate:8080 NEIGHBOR_OCCURRENCE_IGNORE_PERCENTILE: 5 ENABLE_COMPOUND_SPLITTING: "false" image: semitechnologies/contextionary:en0.16.0-v1.0.2 ports: - 9999:9999 ```

Run the following python script, trying to import 6000 articles at once.

# import.py
import pandas as pd
import weaviate
import json
import time

client = weaviate.Client(
    url="http://localhost:8080",
)
class_obj = {
    "class": "Article",
    "moduleConfig": {"text2vec-contextionary": {"vectorizeClassName": "false"}},
}
client.schema.create_class(class_obj)

# https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip
df = pd.read_csv("../vector_database_wikipedia_articles_embedded.csv")
with client.batch(batch_size=1280) as batch:
    for idx, row in df.head(6000).iterrows():
        properties = {
            "title": row.title,
            "text": row.text,
        }
        client.batch.add_data_object(properties, "Article")

This gives an error. There is no error if you use df.head(5120) instead.

$ docker-compose up
Starting weaviate_contextionary_1 ... done
Starting weaviate_weaviate_1      ... done
Attaching to weaviate_contextionary_1, weaviate_weaviate_1
contextionary_1  | panic: runtime error: slice bounds out of range [91160663:22935908]
contextionary_1  | 
contextionary_1  | goroutine 31858 [running]:
contextionary_1  | github.com/semi-technologies/contextionary/contextionary/core.(*Wordlist).getWordPtr(0xc000094c60, 0xc7b98, 0x20, 0x7fcfd37fe95c, 0x7)
contextionary_1  |      /app/contextionary/core/wordlist.go:140 +0x88
contextionary_1  | github.com/semi-technologies/contextionary/contextionary/core.(*Wordlist).FindIndexByWord(0xc000094c60, 0xc00c8a6828, 0x8, 0xc00b753c20)
contextionary_1  |      /app/contextionary/core/wordlist.go:109 +0x169
contextionary_1  | github.com/semi-technologies/contextionary/contextionary/core.(*mmappedIndex).WordToItemIndex(0xc000092000, 0xc00c8a6828, 0x8, 0x0)
contextionary_1  |      /app/contextionary/core/mmapped.go:42 +0x42
contextionary_1  | main.(*Vectorizer).vectorForLibraryWord(0xc000148000, 0xc00c8a6828, 0x8, 0x0, 0x0, 0x0)
contextionary_1  |      /app/server/corpus_vectorizer.go:282 +0x161
contextionary_1  | main.(*Vectorizer).VectorForWord(0xc000148000, 0xc00c8a6828, 0x8, 0xaacfb1, 0x1, 0xc00c8a6828)
contextionary_1  |      /app/server/corpus_vectorizer.go:246 +0x2dc
contextionary_1  | main.(*Vectorizer).vectorsAndOccurrences(0xc000148000, 0xc0012d2000, 0x3c6, 0x3c6, 0xc0, 0xc1, 0xc9, 0xcd, 0xd5, 0xd9, ...)
contextionary_1  |      /app/server/corpus_vectorizer.go:200 +0x129
contextionary_1  | main.(*Vectorizer).vectorForWords(0xc000148000, 0xc0012d2000, 0x3c6, 0x3c6, 0xc00b755850, 0xc0012d2000, 0x3c6, 0x3c6)
contextionary_1  |      /app/server/corpus_vectorizer.go:140 +0x67
contextionary_1  | main.(*Vectorizer).vectorForWordOrWords(0xc000148000, 0xc0012d2000, 0x3c6, 0x3c6, 0xc000163850, 0x3c6, 0x46a701, 0x0)
contextionary_1  |      /app/server/corpus_vectorizer.go:127 +0xb8
contextionary_1  | main.(*Vectorizer).Corpi(0xc000148000, 0xc017b6c110, 0x1, 0x1, 0xc00b755850, 0x0, 0xc00fe8c000, 0x0)
contextionary_1  |      /app/server/corpus_vectorizer.go:101 +0x285
contextionary_1  | main.(*server).VectorForCorpi(0xc00016ee00, 0xb76420, 0xc0186c0210, 0xc0021447d0, 0xc00016ee00, 0xc0186c0210, 0xc001cb9a80)
contextionary_1  |      /app/server/api.go:160 +0x89
contextionary_1  | github.com/semi-technologies/contextionary/contextionary._Contextionary_VectorForCorpi_Handler(0xa8c900, 0xc00016ee00, 0xb76420, 0xc0186c0210, 0xc019b36180, 0x0, 0xb76420, 0xc0186c0210, 0xc00c8a2000, 0x169a)
contextionary_1  |      /app/contextionary/contextionary.pb.go:1523 +0x217
contextionary_1  | google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000e8000, 0xb7c040, 0xc00080c780, 0xc00fe8c000, 0xc00016a420, 0xf181d0, 0x0, 0x0, 0x0)
contextionary_1  |      /go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:995 +0x460
contextionary_1  | google.golang.org/grpc.(*Server).handleStream(0xc0000e8000, 0xb7c040, 0xc00080c780, 0xc00fe8c000, 0x0)
contextionary_1  |      /go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:1275 +0xd97
contextionary_1  | google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000a88090, 0xc0000e8000, 0xb7c040, 0xc00080c780, 0xc00fe8c000)
contextionary_1  |      /go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:710 +0xbb
contextionary_1  | created by google.golang.org/grpc.(*Server).serveStreams.func1
contextionary_1  |      /go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:708 +0xa1
nikhilweee commented 1 year ago

I think I figured this out. The following article was causing the error. Everything else works fine if I omit this article.

{
    "index": 5982,
    "title": "Mali",
    "text": "Mali (Bambara:  ߡߊߟߌ, Fula: 𞤃𞤢𞥄𞤤𞤭, ), officially the Republic of Mali ..."
}

Perhaps because the article contains ADLaM and N'Ko characters?