dataprocess question? - Githubissues

Duperr commented 10 months ago

I'd like to ask you some questions. You are using four csv files for Graph_generation image generation, so what is the use of preprocessing the pcap package in the first place? What is the relationship between the preprocessing operation and the generation of 'edge.csv',' graphid2label.csv', 'node2graphID.csv',' nodeattrs.csv'? Thanks for your answer.

zuluokonkwo commented 10 months ago

Your question isn't clear but I'll try to explain based on my understanding.

This paper has nothing to do with image generation, the outputs of the processing stage are graphs (each graph represents a network session). Graphs are stored in a computer with its adjacency information and can be accessed as either a matrix or list. In the paper, network packets are modeled as graphs nodes are edges represent chronological relationship, node attributes are the packet information in raw bytes.

Pre-processing the pcap file entails masking I.P address information, padding UDP headers etc to ensure information consistency (please see the paper). The four CSVs are for graph generation and labeling (this can be done differently if you choose). This files are what you need to generate the graphs.

I'm guessing your question is "why use the csv files to generate graphs? why not go from pcap to graphs instead" Its much easier and straight forward using python to perform analysis on csv files. Its also easy to fix errors during data processing when files are in csv format. You can try do the same on pcap files with some python libraries but its might not be as straight forward.

To summarize: First we have the PCAP files which we process to get the master csv file The master csv file is processed to get the four csv files Then the four csv files are processed to give us our graphs.

I hope this answers your question.

jason2xx11 commented 8 months ago

Your question isn't clear but I'll try to explain based on my understanding.

This paper has nothing to do with image generation, the outputs of the processing stage are graphs (each graph represents a network session). Graphs are stored in a computer with its adjacency information and can be accessed as either a matrix or list. In the paper, network packets are modeled as graphs nodes are edges represent chronological relationship, node attributes are the packet information in raw bytes.

Pre-processing the pcap file entails masking I.P address information, padding UDP headers etc to ensure information consistency (please see the paper). The four CSVs are for graph generation and labeling (this can be done differently if you choose). This files are what you need to generate the graphs.

I'm guessing your question is "why use the csv files to generate graphs? why not go from pcap to graphs instead" Its much easier and straight forward using python to perform analysis on csv files. Its also easy to fix errors during data processing when files are in csv format. You can try do the same on pcap files with some python libraries but its might not be as straight forward.

To summarize: First we have the PCAP files which we process to get the master csv file The master csv file is processed to get the four csv files Then the four csv files are processed to give us our graphs.

I hope this answers your question.

Thx for your clear reply. Btw, could you give some clue about "The master csv file is processed to get the four csv files "?

zuluokonkwo / Encrypted-Network-Traffic-Classification-with-Higher-Order-Graph-Neural-Network

dataprocess question? #5