yanpanlau / DDPG-Keras-Torcs

Using Keras and Deep Deterministic Policy Gradient to play TORCS
716 stars 267 forks source link

Has anyone tried using image input? #11

Closed sufengniu closed 5 years ago

sufengniu commented 7 years ago

Hello,

Has anyone tried using image as input to train the network? I have worked that for couple of days using a 3 layers conv net to process image substituted original low-dimensional states, but it doesn't work properly.

saiprabhakar commented 7 years ago

Are you sure you are getting the correct inputs, check #4

sufengniu commented 7 years ago

@saiprabhakar Yes, It is correct. Before doing some modification on the code, it outputs gray image. Also, I have tried to use pre-trained actor(using low-dimensional states) as supervisor to train the actor(using image input), which worked pretty well. The critic also could generate plausible q-value. But when I turned off supervised learning using RL, the car just keeps turning left.

This is how I build my Critic mode

def create_critic_network(self, state_size, image_size, action_dim):
    print("Now we build cnn model")
    I = Input(shape=image_size)
    I0 = Convolution2D(64, 5, 5, subsample=(3,3), activation='relu',
        init='uniform',
        border_mode='same',input_shape=image_size)(I)
    I1 = Convolution2D(64, 4, 4, subsample=(2,2), activation='linear',
        init='uniform',
        border_mode='same')(I0)
    I2 = Convolution2D(64, 3, 3, subsample=(1,1), activation='relu',
        init='uniform',
        border_mode='same')(I1)
    I2_5 = Flatten()(I2)
    I3 = Dense(512, activation='linear',
        init='uniform')(I2_5)
    I4 = Dense(HIDDEN2_UNITS, activation='relu')(I3)
    print("Now we build the model")
    # S = Input(shape=[state_size])
    A = Input(shape=[action_dim])
    # w1 = Dense(HIDDEN1_UNITS, activation='relu')(S)
    a1 = Dense(HIDDEN2_UNITS, activation='linear')(A)
    # h1 = Dense(HIDDEN2_UNITS, activation='linear')(w1)
    h2 = merge([a1,I4],mode='concat')
    h3 = Dense(HIDDEN2_UNITS, activation='relu')(h2)
    V = Dense(action_dim,activation='linear')(h3)
    model = Model(input=[A,I],output=V)
    adam = Adam(lr=self.LEARNING_RATE)
    model.compile(loss='mse', optimizer=adam)
    return model, A, I
saiprabhakar commented 7 years ago

Are you saying that you used two actors one using the low-dimensional states which is already trained, and another using images, you used first net to train the second? So you trained actor and critic separately?

For me training using both image and low-dim states together, the training sometime takes ~300 episodes. I haven't used just image to train the network. I don't think using a single frames image will work very well since it doesn't have enough information. But I think since you say it keeps turning left, I think training more would work.

sufengniu commented 7 years ago

@saiprabhakar Thanks for your reply. Yes, you are right, the reason why I use the first net as supervision is that using pure reinforcement learning for image input is not working.

I am using actually three networks. pre-trained actor( I call it guide), actor and critic. For the supervised section. The method for training critic is the same as DDPG provided. For the actor, instead using derivative from critic. I use action generated by guide as label, using mean square error, to minimize the error between guide and actor.

I have also tried concatenate image (convolved, then flatten) with the low-dimensional sensor data, it is still not working. I was wondering how are you designing your convolution layer for processing the image and do you change any other parts of the code.

saiprabhakar commented 7 years ago

I didnt do anything special for image, like you mention I convolved and flattened it. How much training episodes did you run it for.

Which action input are you giving to the critic the guide's or other's?

Just to be clear when you said you turned off the supervised training, did you mean you started the training from scratch without using the guide actor, or just turned of the guide and used critic's gradients? If its the second case, then may be there some destabilization going on (I am not familiar with using pre-trained actor approach).

The guide actor is an interesting idea is there any literature on it?

sufengniu commented 7 years ago

Thank you, That is wired for our network. we fetch 4 frame gray image as 4 channels. we trained 2000 episodes, it still act wired, especially steer (it keeps one for ever). Did you also setup the image in such approach? I would appreciate if you can share the network configuration parameters

I mean I start training with supervised learning, then I turned off the guide and used critic's gradients. I give all three actions (brake, steer, throttle) to the critic the second network (not guide's). the idea is from Ruslan paper for knowledge transfer: https://arxiv.org/pdf/1511.06342v4.pdf

yanpanlau commented 7 years ago

Did you change the TORCS into 64x64 image size? gym_torcs only support 64x64 pixel model.

On Wed, Jan 18, 2017 at 6:51 AM, Sufeng Niu notifications@github.com wrote:

Thank you, That is wired for our network. we fetch 4 frame gray image as 4 channels. we trained 2000 episodes, it still act wired, especially steer (it keeps one for ever). Did you also setup the image in such approach? I would appreciate if you can share the network configuration parameters

I mean I start training with supervised learning, then I turned off the guide and used critic's gradients. I give all three actions (brake, steer, throttle) to the critic the second network (not guide's). the idea is from Ruslan paper for knowledge transfer: https://arxiv.org/pdf/1511. 06342v4.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yanpanlau/DDPG-Keras-Torcs/issues/11#issuecomment-273326416, or mute the thread https://github.com/notifications/unsubscribe-auth/AO1sYyycv669qk40hpw9jzd6kprRBwHMks5rTUXxgaJpZM4LmGWW .

sufengniu commented 7 years ago

yes, I have extracted and visualize the image in python, and it is 64 by 64 image. I found online mentioned that "DDPG is extremely sensitive to bad hyperparameters. I haven't tried on TORCS, but for other control problems I have found reasonably good results by precisely following the recipe in the appendix of the DDPG paper. In my experience even small details (like initializing weights with a uniform distribution rather than a Guassian of the same scale) make a difference." I was guessing the tuning parameters play the trick.

stepjam commented 7 years ago

@sufengniu I have the same problem as you. I execute actions 4 times and collect each resulting image before stacking them and sending them into the CNN. The action values quickly reach their maximum/minimum value and result in sharp turning. I have tried playing with the learning rate, but the final outcome is always the same :(

ghliu commented 7 years ago

@stepjam @sufengniu Hi, I am able to get the DDPG agent works using only 4 sequential images as its observation, the performance is comparable with @yanpanlau original implementation using laser+physical states. Some implementation details for your reference:

  1. The image is normalized from 0.0 to 1.0, and I am using RGB image
  2. Instead of executing 4 times and collect the images, I keep the last 3 previous images and concatenate the latest one to form the outputted observation. (so the dimension is 646412 )
  3. I use three 2D convolution layers with ReLu before plugging into the dense layer. Doing so greatly reduce the dimension of state and make the training feasible. Also, each conv. layer is followed by a batch normalization layer.
  4. For the env E-Road, agent will start performing reasonable policy around 200~300th episodes.
sufengniu commented 7 years ago

@ghliu Thank you very much for your info. I will try it. will post the results later if I got any news.

stepjam commented 7 years ago

@ghliu Thank you also. Was this using Batch Normalization?

ghliu commented 7 years ago

@stepjam Yeah, you are right. I forgot to mention :P Just updated the previous comments.

pavitrakumar78 commented 7 years ago

@ghliu Thank you for your suggestions! I am also trying to make this work using images as input. I have a few doubts regarding your work. -Does laser+physical mean - you are using both vision and car-related params such as angle, speed etc.. to train your model? -Which network have you used? If its 3 layers, I assume the nb_filter will be 32,64,64 with normalization layers in between? -Also, if you are concatenating the frames, 64x64x12 is the input data for the cnn to predict the move to be played at that frame right? (4th frame)

@sufengniu Hi! Have you had any success in training a vision based DDPG model?

stepjam commented 7 years ago

@pavitrakumar78 -- I have tried many variations, and my thinking is that the hyper-parameters are super sensitive. Unless you have them within this sweet spot, you just end up with garbage.

Edit: Just to clarify, I'm talking about the algorithm itself, not the implementation.

pavitrakumar78 commented 7 years ago

@stepjam Hi! Thanks for your input! Yes, it seems that way in the tests that I have been doing, though what I've been working on is not exactly DDPG, it is somewhat closer to it. I am still trying to figure out what is best ! :)

sufengniu commented 7 years ago

@pavitrakumar78 I agree with @stepjam, I think both DDPG and actor-critic algorithms are very sensitive to hyper-parameter settings. I also have same experiences on DDPG for other environment. it might need to explore how to set good initialization and hyperparameters for TOCRS.

By the way, mine is also not working

ghliu commented 7 years ago

@pavitrakumar78 , to answer your questions

  1. physical + laser is just my rough classification for the original implementation of this repo, where the physical states include 10 states such as vehicle velocity, and 19 states from simulated laser beam (called track in gym_torcs.py). It doesn't include vision.
  2. I am using 16, 32,32 as nb_filter, and yes BN layer in between.
  3. Yes. The state includes previous three frames and the current frame.

Most of the hyper-parameters remain the same, only the minibatch size is reduced to 16 instead of 32, as suggested by the original DDPG paper. And I am using Keras 1.1.0. I noticed there are some differences w.r.t. BN layer implementation among Keras version. At least the parameters printed out from model.summary() is different from Keras 1.2.0 using same code. From my experience, both training and testing is working fine here, and I just did more experiments recently to verify it again. (Actually same vision input also works with other algorithms such as CDQN(NAF).) BN implementation is currently my only suspicion, other than that I have no clues yet.

I am planning to release codes, network weights, and probably some videos later.

pavitrakumar78 commented 7 years ago

@ghliu Hi! Thanks for your answers! Currently I am testing these networks: Actor:

        S = Input(shape=state_size)   
        S_in = Lambda(lambda img: img / 255.0)(S) 
        conv1 = Convolution2D(16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu')(S_in)
        batch_norm1 = BatchNormalization()(conv1)
        conv2 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu')(batch_norm1)
        batch_norm2 = BatchNormalization()(conv2)
        conv3 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation = 'relu')(batch_norm2)
        batch_norm3 = BatchNormalization()(conv3)
        flat = Flatten()(batch_norm3)
        den = Dense(300, activation='relu')(flat)

and Critic:

        S = Input(shape=state_size)  
        S_in = Lambda(lambda a: a / 255.0)(S)
        conv1 = Convolution2D(16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu')(S_in)
        batch_norm1 = BatchNormalization()(conv1)
        conv2 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu')(batch_norm1)
        batch_norm2 = BatchNormalization()(conv2)
        conv3 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation = 'relu')(batch_norm2)
        batch_norm3 = BatchNormalization()(conv3)
        flat = Flatten()(batch_norm3)
        h1 = Dense(300, activation='relu')(flat)
        A = Input(shape=[action_dim],name='action2') 
        a1 = Dense(300, activation='linear')(A) 
        h2 = merge([h1,a1],mode='sum')    
        h3 = Dense(HIDDEN2_UNITS, activation='relu')(h2)
        V = Dense(action_dim,activation='linear')(h3)   
        model = Model(input=[S,A],output=V)

I am trying 2 models - one with only steering as output and another with all 3 outputs (steering, break, accel.). So far, the in a sample run with all 3, I observed that the steering was largely positive and very close to zero (10^-5) and accel and break seemes to be even close to zero - so the car just stays in one spot. I am now running some experiments with only steering as output. For these experiments, I am using the same base code as this repo, with minor changes regarding network input dims.

@ghliu In your tests, did you use the same gym_torcs.py file as this repo, since the code in this repo has some changes regarding what we consider a terminal state as and some minor changes to how reward is calculated

ghliu commented 7 years ago

@pavitrakumar78 I also experienced similar behavior when I tried to train TORCS with CDQN(NAF) with origianl 29 states input. In my case, I guessed it was converged to a trivial policy where agent tried to stay in the middle of track as much as possible but with an extremely slow velocity. And surprisely this policy actually gave a higher reward since the episode will never terminate... This is the issue w.r.t. how to design the reward and terminate condition "right" :/ But yet I didn't notice if same situation happens in DDPG or not.

I am using the gym_torcs.py as this repo, but modified quite a lot for other usages. Reward function remained the same, which is sightly different from the original paper. I terminated the episode if the car runs backward (and give it high negative reward), and if the car is out of track with no progress and negative speedX.

pavitrakumar78 commented 7 years ago

@ghliu Thanks for your reply! Yes that I noticed that behaviour also, the agent seems to stay inside the road, it seems to have somehow figure out steering, but acceleration becomes less and brake gradually increases so it actually doesn't move. I will work on the reward-termination conditions and try a few more experiments!

dongleecsu commented 7 years ago

@pavitrakumar78 Hi, any good news with your vision experiment (1 output or 3 outputs)? I've tried many hyper-parameters settings both for 1 output (steer) and 3 outputs (steer, accel., brake). When the output is just steer and the network starts to train, the output will quickly reaching their maximum/minimum value. And after training for 1000 epochs, the output is still the maximum/minimum value.

As you say above: the agent seems to have figured out steering, do you meet the above situation when training? Many thanks.

pavitrakumar78 commented 7 years ago

@dongleecsu Hi, yes - I observed the explained scenario while training. But as I said, even though it managed to stay inside the road - the acceleration and break parameters are not learned properly. Unfortunately, I haven't had time to use only steering as output and run training again because I am working on another topic now. If the output quickly reaches the max value i.e always goes to left or always goes to right then it is actually not learning. You might have to try playing around with network configurations or try modelling rewards in a different way.

arsenious commented 7 years ago

@ghliu

Hi, I think the original code in this repo doesn't take into account the past frames. Have you already implemented that in your code in your repo?

Thanks

Sophistt commented 6 years ago

@ghliu Hi, I tried to train my model by only using images, but I found that the runtime increased from 0.2s to 0.4s in a cycle (from choosing an action to next choosing an action). Is this reasonable? How long was the runtime in a cycle when you trained your model?

AbdullahMohamed55 commented 6 years ago

@Sophistt Did the model train and produce reasonable output?

XiaoZzai commented 6 years ago

@ghliu do you train your model with image as input from scratch?I failed to do so ,the agent sharply turn right finally·

damienlancry commented 5 years ago

Hi everyone, I am also trying to design a RL agent able to learn a good policy from only pixels on TORCS domain. I am following the experiment details from deepmind paper "Continuous Control with Deep Reinforcement Learning". In this paper, they say that some replicas were able to learn good policies on TORCS environment from only pixels, but some other replicas failed to learn sensible policies , so this seems to be a rather difficult problem... They also mention that they used 3 convolutional layers with 32 filters at each layer but they do not mention the kernel sizes and the strides. Any ideas about it??? At the moment I am using kernel sizes of 4 at each layer and strides of 1 at each layer. They also mention that they altered the OU noise in this environment "because of the different timescales involved". I did not really understand this sentence: does it mean they modified the noise, made it smaller, or does it mean they simply removed it (which seems unlikely because we need exploration)??? At the moment i am using the implementation of OU noise from openai baselines and i use mu = (0,0,0), sigma = (0.2,0.2,0.2) and theta = (0.15,0.15,0.15) but i admit that my understanding of these parameters is limited. In my implementation the actor release a continuous action in [-1,1]^3 and i then rescale it to [0,1] for acceleration and brake, that is why i think that mu should be (000) so that it eventually vanishes.
Also they also used a reward equal to "the velocity of the car projected along the direction of the track and a negative reward of -1 in case of collision" which mathematically translates to r = Vx * cos(theta) and i chose to stick to this reward function for the moment. But i am wondering if the reward then involved needs clipping or rescaling ? as they did in "Human level control with Deep RL" (DQN paper). At the moment my agent immediately learns to steer on the left, accelerate and not brake (steering =1 , acceleration = 1 , brake =0) and never stop doing it. I reset the environment every time the car goes out of the track. I am using as inputs 64x64x9 tensors that is 3 rgb images stacked along the 3rd axis as in the original paper. In my implementation, every timestep takes 0,05 seconds to run so i am assuming that there is an action repeat of 3 considering that every 0.02 seconds an action is simulated (50 Hz) and until the server receives a new action from the client, it simply repeats the last one received. A minor difference between my implementation and the one from deepmind is that the three rgb images that i stack are not the consecutive ones in between two calls from the client, they are the three images from the three last calls of the client. So instead of being spaced by 0,02 seconds they are spaced by about 0.05 seconds. I don t know if it makes a big difference. What do you guys think about it? The reason why i did it this way is that a call/response from the client seems to be very expensive time wise (it takes about 0.05 i would say but not sure about the time it takes for a forward pass through the actor and critic networks), so to have the three consecutive images the best way would be to modify the source code of vtorcs so that the server sends 3 images instead of only one.

My implementation of ddpg is hugely based off the one from openai and i modify this original repo so that the class TorcsEnv inherits from gym.env to make it compatible with the code from openai.

Thank you very much for any advice about how to make it work! :)

theOGognf commented 5 years ago

I had the same problem (but I fixed it, scroll down for solution) - the actor was almost always predicting the max/min value after a short number of iterations and then staying that way. When this was happening, my actor and critic were 2 layers (64 neurons, 64 neurons) with relu hidden activations and tanh and linear output activations, respectively. The actor used a learning rate of 1e-4 while the critic used a learning rate of 1e-3; the critic also used a weight decay of 1e-2. The actor's output would reach the min/max values and then stay there because the gradients around that area are near zero. The critic's gradient scale for the actor (for the ddpg update rule) was also really small.

I tried a million things to try to stop the actor from falling into this rut. I tried 20 different hidden activation functions for both the actor and critic, I tried adding neurons for both the actor and critic, I tried different architectures, and I tried all the optimizers available to me (including Adam, Adamax, Adagrad, Adadelta, RMSProp, Rprop, and SGD).

SOLUTION: The thing that did it for me that resulted in CONSISTENT mid-range actor output and return improvement was just increasing the number of neurons for the critic (and not the actor). My final actor had 2 layers (64 neurons, 64 neurons) and my final critic had 2 layers (200 neurons, 200 neurons).

Perhaps someone could elaborate why this ended up working for me.

EDIT: I should also note, my actor had 30 inputs while my critic had about 100 (using extra state information similar to MADDPG).

sufengniu commented 5 years ago

Hi @theOGognf , Thank you for your suggestions! I haven't work on this project for a while, but I am still curiosity how it works. To clarify your methods, what you did is only adding more neurons on critics to 200-200 while keeps actor 64-64? Also, you mentioned your actor has 30 inputs, and critic has 100, can you provide more details here? I thought the actor and critic use image based input and share the initial convolution layers. Thank you!

theOGognf commented 5 years ago

Correct - I just added more neurons to the critic. My critic makes use of extra environment information (such as other agents' actions) because I was working on a multi-agent problem. This is why my critic has more inputs than my actor. Also, I apologize for not clarifying, but I didn't have any convolutional layers because I just used MLPs.

cyrilibrahim commented 5 years ago

Hi @theOGognf, this paper explore this idea of having a critic network with more capacity and information it explains why it works better. Asymmetric Actor Critic for Image-Based Robot Learning

BCWang93 commented 5 years ago

@ghliu Hi! Thanks for your answers! Currently I am testing these networks: Actor:

        S = Input(shape=state_size)   
        S_in = Lambda(lambda img: img / 255.0)(S) 
        conv1 = Convolution2D(16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu')(S_in)
        batch_norm1 = BatchNormalization()(conv1)
        conv2 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu')(batch_norm1)
        batch_norm2 = BatchNormalization()(conv2)
        conv3 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation = 'relu')(batch_norm2)
        batch_norm3 = BatchNormalization()(conv3)
        flat = Flatten()(batch_norm3)
        den = Dense(300, activation='relu')(flat)

and Critic:

        S = Input(shape=state_size)  
        S_in = Lambda(lambda a: a / 255.0)(S)
        conv1 = Convolution2D(16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu')(S_in)
        batch_norm1 = BatchNormalization()(conv1)
        conv2 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu')(batch_norm1)
        batch_norm2 = BatchNormalization()(conv2)
        conv3 = Convolution2D(32, nb_row=4, nb_col=4, subsample=(2,2), activation = 'relu')(batch_norm2)
        batch_norm3 = BatchNormalization()(conv3)
        flat = Flatten()(batch_norm3)
        h1 = Dense(300, activation='relu')(flat)
        A = Input(shape=[action_dim],name='action2') 
        a1 = Dense(300, activation='linear')(A) 
        h2 = merge([h1,a1],mode='sum')    
        h3 = Dense(HIDDEN2_UNITS, activation='relu')(h2)
        V = Dense(action_dim,activation='linear')(h3)   
        model = Model(input=[S,A],output=V)

I am trying 2 models - one with only steering as output and another with all 3 outputs (steering, break, accel.). So far, the in a sample run with all 3, I observed that the steering was largely positive and very close to zero (10^-5) and accel and break seemes to be even close to zero - so the car just stays in one spot. I am now running some experiments with only steering as output. For these experiments, I am using the same base code as this repo, with minor changes regarding network input dims.

@ghliu In your tests, did you use the same gym_torcs.py file as this repo, since the code in this repo has some changes regarding what we consider a terminal state as and some minor changes to how reward is calculated

Hi.Do you have the changed code in your reposi..Can you share the change code use image as input.Thanks!

BCWang93 commented 5 years ago

@pavitrakumar78 Do you have the changed code?I need the help about this recently.Thanks!

BCWang93 commented 5 years ago

@sufengniu .Hi.Do you have solve this problem?I also need this help recently.Do you have the success changed code?Thanks!

pavitrakumar78 commented 5 years ago

@BCWang93 No, sorry! Please see my comment from March 1, 2017, above. The code might be alright, but you need to have very specific parameters to create a good agent. As I mentioned in my previous comment, I had to drop this idea due to insufficient time and resources to test various models.

BCWang93 commented 5 years ago

@saiprabhakar HI,I see the code in your repertories,and when I run your code ,I have some problems!Like this: 'Episode : 0 Replay Buffer 0 Reset Client connected on 3101.............. AL lib: (WW) alSetError: Error generated on context 0x559e59362be0, code 0xa005 OpenAL backend info: Vendor: OpenAL Community Renderer: OpenAL Soft Version: 1.1 ALSOFT 1.18.2 Available sources: 256 Available buffers: 1024 or more Dynamic Sources: requested: 235, created: 235

static sources: 21

dyn sources : 235

sw 64 - sh 64 - vw 64 - vh 64 - imgsize 12288 Timeout for client answer reward : -0.0373694501845383 speed : 0.0004189766446749369 0.0005422233541806539 angle : -5.018398146463313e-05 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.15758433975354663 speed : 3.176469976703326e-05 0.0005570299923419953 angle : 6.137541220011154e-05 Timeout for client answer reward : 0.8613673280581836 speed : 0.0029609733819961548 8.04476688305537e-05 angle : 0.000354313085399718 Timeout for client answer Timeout for client answer reward : 1.682484540400641 speed : 0.005726499954859416 -7.322899997234344e-05 angle : 0.0012115768983681253 Timeout for client answer Timeout for client answer Timeout for client answer reward : 2.551412182629766 speed : 0.008949233690897623 -0.0003278320034344991 angle : 0.0029544850945594143 Timeout for client answer Timeout for client answer Timeout for client answer reward : -200 speed : 0.0009257266918818156 0.0003867300103108088 angle : 0.35878850480573365 Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.09349016998298715 speed : 0.0006257766485214234 0.00039279334247112275 angle : 0.3575089098024769 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.04961418643845049 speed : 0.0004186399777730306 0.00029528533418973286 angle : 0.3558568806328906 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.07782049210403288 speed : 0.00016058099766572316 0.00041368665794531506 angle : 0.35355867913335765 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.05215550162644092 speed : 0.0008088266849517823 0.0006802533566951752 angle : 0.3498822245732621 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.029082296380321563 speed : -1.9024000503122808e-05 0.0002455329895019531 angle : 0.3478418621119574 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.06479187085705217 speed : 0.0003812933216492335 0.0002211016664902369 angle : 0.3493124363469143 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.15313247272990058 speed : 0.00038635998964309693 0.0008569199840227763 angle : 0.3438248095817168 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.004190041334686485 speed : -0.0001941366617878278 0.00023039365808169047 angle : 0.33658964505837957 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.03903788701192805 speed : 6.397699937224388e-05 0.00029870199660460154 angle : 0.33169085511367136 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.06973220133052763 speed : 0.00026925099392731985 0.0004398899773756663 angle : 0.3245575360617991 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.05204503192681713 speed : 0.00038338000575701396 0.0005094366768995921 angle : 0.32004392994005687 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : -0.035121564135139505 speed : 0.0008291466534137725 0.0007453999916712442 angle : 0.3151133227718675 Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer Timeout for client answer reward : 0.06410825537515116 speed : 0.00018982899685700735 0.0006868066887060801 angle : 0.30522918940986843 reward : 0.04599678479179314 speed : 0.00016788333654403685 0.0005000866452852885 angle : 0.30199197190452054 AL lib: (EE) alc_cleanup: 1 device not closed reward : 0.04602624022931155 speed : 0.00028094532589117684 0.0005083799858887991 angle : 0.2940622546577502 ' Can you help me to solve this problem?Thanks!

chouer19 commented 5 years ago

hello, I want to know if you secceed using image input directly. I am trying to do so, but I do not know how. If you have done this, can I learn from you please?

chouer19 commented 5 years ago

Hello,

Has anyone tried using image as input to train the network? I have worked that for couple of days using a 3 layers conv net to process image substituted original low-dimensional states, but it doesn't work properly.

hello, I want to know if you secceed using image input directly. I am trying to do so, but I do not know how. If you have done this, can I learn from you please?

chouer19 commented 5 years ago

Hello,

Has anyone tried using image as input to train the network? I have worked that for couple of days using a 3 layers conv net to process image substituted original low-dimensional states, but it doesn't work properly.

@sufengniu hello, I want to know if you secceed using image input directly. I am trying to do so, but I do not know how. If you have done this, can I learn from you please?

sufengniu commented 5 years ago

Hello, @chouer19 I haven't work on this problem for a long time. I haven't really seen anyone success in the public repo using pure image-based input (I didn't count anyone who claim they success but didn't release the code or proof). All success cases are based on low-dimensional sensor features. I ever tried use supervised learning to train the agent, once switch to DDPG or Actor critic, the algorithm would diverge. I think if you pre-train the image representation by a image auto-encoder (ex: U-net), then reuse it for agent, it might be helpful. that is all I know.