yoheinakajima / instagraph

Converts text input or URL into knowledge graph and displays
MIT License
3.47k stars 289 forks source link

Add Graph API #81

Closed gkorland closed 11 months ago

gkorland commented 11 months ago

Fix #80

Summary by CodeRabbit

coderabbitai[bot] commented 11 months ago

Walkthrough

The codebase has been updated to introduce a new abstract Driver class, designed to standardize interactions with graph databases. Subsequent changes in the Neo4j driver class align it with the new abstract methods. The main application file has been refactored to accommodate different types of graph databases, with a focus on extensibility and improved database interaction through a unified driver interface.

Changes

File Path Change Summary
drivers/driver.py Introduced abstract Driver class with essential abstract methods.
drivers/neo4j.py Updated Neo4j class to implement new abstract methods and added data processing functionality.
main.py Refactored to use a generic driver and support multiple graph database types with a new --graph argument.
templates/index.html Modified conditional check and error message related to graph data processing.

Assessment against linked issues

Objective Addressed Explanation
Add support for other Graph Databases (Issue #80) βœ…

πŸ‡βœ¨ In the code where graphs intertwine, A new Driver class now does shine. With methods so abstract, so fine, It guides the data through the code-vine. 🌿🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on X ?


Tips ### Chat with CodeRabbit Bot (`@coderabbitai`) - You can reply to a review comment made by CodeRabbit. - You can tag CodeRabbit on specific lines of code or files in the PR by tagging `@coderabbitai` in a comment. - You can tag `@coderabbitai` in a PR comment and ask one-off questions about the PR and the codebase. Use quoted replies to pass the context for follow-up questions. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger a review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - The JSON schema for the configuration file is available [here](https://coderabbit.ai/integrations/coderabbit-overrides.v2.json). - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json`
gkorland commented 11 months ago

@snakajima @isamu please review the PR, I'm planing to add a followup PR that will include support to another Graph Database

isamu commented 11 months ago

I tried to run, and I got error.

$ python main.py  --port 8081
Traceback (most recent call last):
  File "/Users/isamu/instagraph/main.py", line 14, in <module>
    from drivers.neo4j import Neo4j
  File "/Users/isamu/instagraph/drivers/neo4j.py", line 4, in <module>
    from driver import Driver
ModuleNotFoundError: No module named 'driver'
isamu commented 11 months ago
diff --git a/drivers/neo4j.py b/drivers/neo4j.py
index 819fce3..06c398d 100644
--- a/drivers/neo4j.py
+++ b/drivers/neo4j.py
@@ -1,7 +1,7 @@
 import os
 from typing import Any

-from driver import Driver
+from drivers.driver import Driver
 from neo4j import GraphDatabase

then I got different error

 File "/Users/isamu/.pyenv/versions/3.10.3/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:
isamu commented 11 months ago

The latter is a problem with my settings. An error occurred on the main branch as well. However, the main branch runs without crashing.

Neo4j database: Cannot resolve address xxxxx
isamu commented 11 months ago

I fixed neo4j's settings and try it again, but it seems that I can't get data from the db. After that, when I switched to the main branch, I was able to check the data on the main branch. It seemed that the save was successful.

isamu commented 11 months ago

This condition causes an error.

https://github.com/yoheinakajima/instagraph/blob/main/templates/index.html#L265-L268

gkorland commented 11 months ago

I tried to run, and I got error.

$ python main.py  --port 8081
Traceback (most recent call last):
  File "/Users/isamu/instagraph/main.py", line 14, in <module>
    from drivers.neo4j import Neo4j
  File "/Users/isamu/instagraph/drivers/neo4j.py", line 4, in <module>
    from driver import Driver
ModuleNotFoundError: No module named 'driver'

@isamu thanks for the feedback... Very strange I don't understand how come I didn't get this error, I'll fix it.

gkorland commented 11 months ago

@isamu I just tried it and it seems to work:

Here are the screenshots from both Neo4J browser and Instagraph

image image

And looking at the logs it seems to work OK

27.0.0.1 - - [18/Dec/2023 14:15:28] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [18/Dec/2023 14:15:29] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [18/Dec/2023 14:15:30] "GET /get_graph_history HTTP/1.1" 200 -
web scrape done
starting openai call Help me understand following by describing as a detailed knowledge graph: September 19, 2023 The ability to scale out a database is crucial, in this short post I would like to walk through how FalkorDB scales out. As a quick recap I should mention that FalkorDB is a native graph database, developed as a Redis module, Falkor can manage thousands of individual graphs on a single instance. We start out with a single FalkorDB instance, let's call it primary, this instance handles both READ and WRITE operations. # create primary database docker run --name primary --rm -p 6379:6379 falkordb/falkordb As a next step we would like to isolate our reads queries from our writes, to do so we fire up a new FalkorDB instance, let's name it secondary and define it as a replica of primary. Once initial replication between the two servers is done we can divert all of our read queries to secondary and only hand off write queries to primary. # create replica docker run --name secondary --rm -p 6380:6380 falkordb/falkordb --port 6380 --replicaof 172.17.0.2 6379 It is worth mentioning that we're not limited to just a single READ replica, but we can create as many READ replicas as we need, e.g. a single primary and three read replicas: replica-1, replica-2 and replica-3. A load balancer can distribute the read load among these three replicas. In the former example we've distributed the entire dataset from the primary database to multiple replicas, in cases where multiple graphs are managed on a single server e.g. primary-1 holds graphs: G-1, G-2 and G-3.We can distribute the graphs among multiple servers, for example primary-1 would manage G-1 and a new server primary-2 would host G-2 and G-3. Write operations will be routed to the appropriate server depending on the accessed graph key.Of course each one of these primary servers can have multiple read replicas. e.g. primary-1 can have two read replicas and primary-2 will replicate its dataset to just a single replica. FalkorDB version 4 introduce a quick and efficient way of replicating queries between primary and its replicas. Up until recently a WRITE query (which ran to completion and modified a graph) would be replicated as-is to all replicas, causing each replica to re-run the query, although such a replication schema is simple and straightforward it entails a number of issues: 1. Replicated query might fail due to insufficient resources or timeout. 2. Using time related or random functions within a query risks ending up with data discrepancy. # Usage of time and randomnessMATCH (a), (b) WHERE a.create < time() -100 AND b.id = tointeger(100 * rand()) CREATE (a)-[:R]->(b) Although some WRITE queries are short and quick to execute e.g. CREATE (:Country {name:'Montenegro'}) Others might include a long and costly read portions e.g. MATCH (c:Country) WITH average(c.area / c.population) as avg_density MATCH (c:Country) WHERE c.area / c.population > avg_density SET c.crowded = true It would be a waste of time re-running write queries on the replicas, the primary DB had already done the hard work, it computed the "change-set" and so instead of sending the original query to its replicas the primary sends the query's "effects", an effect is a compact binary representation of a change e.g. connect node 5 to node 72 with a new edge, or update node 81 'score' attribute to the value 4. Replicating via effects solves the two problems we've mentioned earlier, in addition to saving the time spent computing what needs to be changed on the replicas. The benchmark tests three setups: single Primary Primary & Replica seperating reads from writes Primary & 2 Replicas scaling out reads Querying a graph with ~50M nodes and ~50M edges Creating the dataset; CREATE INDEX FOR (p:Person) ON (p.id) UNWIND range(0, 1000000) AS x CREATE (p:Person {id:x}) MATCH (p:Person) UNWIND range(0, toInteger(rand() * 100)) AS x CREATE (p)-[:CONNECTED]->(:Z) READ query: MATCH (p:Person {id:$id})-[]->() return count(1) WRITE query: MATCH (a:Person {id:$a_id}) CREATE (a)-[:CONNECTED]->(:Z)
Results from Graph: (EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x7f7579d97bd0>, keys=[]), EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x7f7579da0810>, keys=[]))
127.0.0.1 - - [18/Dec/2023 14:16:34] "POST /get_response_data HTTP/1.1" 200 -
127.0.0.1 - - [18/Dec/2023 14:16:35] "POST /get_graph_data HTTP/1.1" 200 -
isamu commented 11 months ago

Graph is working, but history is not working.

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2023-12-19 5 06 37
gkorland commented 11 months ago

@isamu fixed (didn't handle the response correctly)

isamu commented 11 months ago

@gkorland Thank you for your patience. I think it's very good PR. I'll merge this!

gkorland commented 11 months ago

πŸ™ next I'll push a PR that will add support to another Graph database

isamu commented 11 months ago

I'm looking forward to your graph database!