zhongkaifu / RNNSharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
BSD 3-Clause "New" or "Revised" License
285 stars 91 forks source link

Help with Converting Spatio-Temporal Dataset for Consumption #19

Open trecius opened 8 years ago

trecius commented 8 years ago

Hello,

I have a spatio-temporal dataset that I have compiled. It's in a TSV format, and I'd like your RNNSharp to consume the input for classification as well as recognition. My features are continuous values in the range [0, 1]. My TSV file looks like the following:

ID1 0.923 0.223 0.573 0.235 0.111 ID1 0.920 0.228 0.353 0.213 0.098 ID1 0.901 0.677 0.235 0.551 0.121 ... ID1 0.853 0.383 0.301 0.618 0.132

ID1 0.918 0.733 0.622 0.222 0.238 ID1 0.985 0.682 0.793 0.221 0.465 ... ID1 0.953 0.788 0.912 0.228 0.539 ID2 0.918 0.733 0.622 0.222 0.238 ID2 0.985 0.682 0.793 0.221 0.465 ... ID2 0.953 0.788 0.912 0.228 0.539 Each line in my TSV is a snapshot at a specific moment in time. When all snapshot are combined, it describes the spatio-temporal entity. These entities are separated by an EMPTY LINE. Therefore, the first instance ID1 is all the lines until you reach the empty line. The second instance of ID1 is the next set of contiguous lines and so on. Note, the first TSV value is just a class label and is not a feature. Also, I have 6 class labels for this spatio-temporal dataset. 1.) First, how can I transform my data into an "embedded feature" that is in the correct model format? I assume this is the Txt2Vec? 2.) Additionally, I will have to create a corpus. Will the following work for the corpus? ID1 ClassLabel1 ID2 ClassLabel2 ID3 ClassLabel3 ID4 ClassLabel4 ID5 ClassLabel5 ID6 ClassLabel6 3.) Additional steps or a walkthrough would be greatly appreciated. I hope this information helps all others who are trying to consume RNNSharp. When I finish, I hope to compile a walkthrough for others, so they can easily consume this great technology. Thank you.
zhongkaifu commented 8 years ago

For each time frame (one line in your training corpus), if it only contains 5 features, you could build embedding model likes. That means each time frame has its unique id. ID1 0.923 0.223 0.573 0.235 0.111 ID2 0.920 0.228 0.353 0.213 0.098 ID3 0.901 0.677 0.235 0.551 0.121 ... ID2 0.920 0.228 0.353 0.213 0.098

I just updated RNNSharp to support embedding model in raw text format, so you could use above format for training directly. Please replace WORDEMBEDDING_FILENAME with WORDEMBEDDING_RAW_FILENAME in configuration file.

For #2, yes. It looks good. For example, it may looks like ID1 Wave ID2 Label2 ID2 Wave ... IDn LabelX

For each time frame, it has a corresponding label as result.

trecius commented 8 years ago

Hello:

I'm getting closer. I've since extracted all my time frames that I want to train the dataset into a single file: rawModel.txt. It has the format:

\t\t\t\t\t \t\t\t\t\t ... \t\t\t\t\t I've also created a train.txt file, and it is in the format: \t \t \t ... \t Finally, I've also create a template.txt file. It looks like this: U01:%x[0,0] U02:%x[0,1] U03:%x[0,2] U04:%x[0,3] U05:%x[0,4] U06:%x[-1,0] U07:%x[-1,1] U08:%x[-1,2] U09:%x[-1,3] U10:%x[-1,4] U11:%x[1,0] U12:%x[1,1] U13:%x[1,2] U14:%x[1,3] U15:%x[1,4] I've modified the BAT file to use the new files, but it's not working the way I had planned. 1.) How does RNNSharp (RNNSharpConsole) know when one spatio-temporal entity has completed and a new one begins? I'm more talking about the edge cases. I've tried to split up them using a blank line, but an exception is thrown, stating the lengths are not the same.
zhongkaifu commented 8 years ago

Since you are going to use continuous values as features, the template.txt should only keep one line: U01:%x[0,0]. All of other lines are used for discrete features only.

In training corpus, RNNSharp uses a blank line to split two entities, but embedding model (rawModel.txt in your example) needn't to use blank lines, since embedding model is just a key-value pair, RNNSharp access embedding model by keyword, and get dense features from embedding model for encoding or decoding.

RNNSharp already supports embedding model in raw text format, you could sync the latest code from depot and use it. In your case, the configuration file looks like:

The file name for template feature set

TFEATURE_FILENAME: tfeature

The context range for template feature set. In below, the context is current token, next token and next after next token

TFEATURE_CONTEXT: 0

WORDEMBEDDING_RAW_FILENAME: rawModel.txt

The context range for word embedding.

WORDEMBEDDING_CONTEXT: -1, 0, 1

The column index applied word embedding feature

WORDEMBEDDING_COLUMN: 0

I hope these information can help you. For exception you mentioned, could you please show more detailed information about it ?