youki-dev / youki

A container runtime written in Rust
https://youki-dev.github.io/youki/
Apache License 2.0
6.31k stars 346 forks source link

Fully support OwnedFd #2417

Open anti-entropy123 opened 1 year ago

anti-entropy123 commented 1 year ago

ref: issue #2317, PR #2369

Introducation Since nix version 0.27.1, some APIs (e.g., nix::pty::openpty(), nix::sys::socket::socket()) have started using OwnedFd to manage FDs. However, it is not straightforward to pass OwnedFd between forks. So in PR #2369, we still use RawFd in the subsequent steps.

This issue try to do the change from RawFd to OwnedFd.

anti-entropy123 commented 1 year ago

Some ideas The main challenge with using OwnedFd is that ContainerArgs requires its fields to satisfy the Clone trait, but OwnedFd does not (and shouldn't, in my opinion) support it. Maybe we can impl Clone for structs like Sender, in which we can use OwnedFd::try_clone() to safely share underlying sockets or connections. If we do this, we also need to check whether old FD will be leaked due to CloneCb.

oirom commented 11 months ago

Hi, can i try this issue?? I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

utam0k commented 10 months ago

Don't worry ;)

I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

zahash commented 7 months ago

@anti-entropy123 what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

impl Clone for ContainerType {
    fn clone(&self) -> Self {
        match self {
            Self::InitContainer => Self::InitContainer,
            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer { exec_notify_fd: exec_notify_fd.try_clone().unwrap() },
        }
    }
}

is unwrapping fine?

zahash commented 7 months ago

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

zahash commented 7 months ago

does it make sense to do this maybe?

pub enum ContainerType {
    InitContainer,
    TenantContainer { exec_notify_fd: Rc<OwnedFd> },
}
anti-entropy123 commented 7 months ago

what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

Directly unwrapping indeed doesn't seem quite appropriate...

Could we just implement a try_clone method for structures like ContainerType (should also include ContainerArgs), rather than derive(Clone)? This way, errors can be passed, reusing the error handling logic from before. 👀

This suggestion might not be fully matured, but I hope it helps you. @zahash

zahash commented 7 months ago

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

@anti-entropy123 what about this issue then?

anti-entropy123 commented 7 months ago

what about this issue then?

I think closing one of the file descriptors won't affect the usability of the other one.

zahash commented 7 months ago

@anti-entropy123

but the OwnedFd calls close on the wrapped RawFd when dropped.

#[stable(feature = "io_safety", since = "1.63.0")]
impl Drop for OwnedFd {
    #[inline]
    fn drop(&mut self) {
        unsafe {
            // Note that errors are ignored when closing a file descriptor. The
            // reason for this is that if an error occurs we don't actually know if
            // the file descriptor was closed or not, and if we retried (for
            // something like EINTR), we might close another valid file descriptor
            // opened after we closed ours.
            #[cfg(not(target_os = "hermit"))]
            let _ = libc::close(self.fd);
            #[cfg(target_os = "hermit")]
            let _ = hermit_abi::close(self.fd);
        }
    }
}
anti-entropy123 commented 7 months ago

@zahash Sorry, I don't have a suitable Linux machine with me right now. I can try verifying the issue tomorrow morning.

TenantContainer { exec_notify_fd: Rc },

I've discussed this idea before; you can refer to this link for more information: https://github.com/containers/youki/pull/2369#issuecomment-1746761562

zahash commented 7 months ago

@anti-entropy123 @utam0k does cargo test run all the available tests?

zahash commented 7 months ago

also, mmap no longer accepts f: Option<F> and instead just accepts f: F and with in mmap, it tries to unwrap_or(-1) to get the raw fd.

but now, i can't find a way to create a invalid Fd OwnedFd::from_raw_fd(-1) panics at runtime because rust doesn't allow -1 (asserts that fd != -1)

        mman::mmap(
            None,
            NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
            mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
            mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
            OwnedFd::from_raw_fd(-1), // used to be None
            0,
        )
        .map_err(CloneError::StackAllocation)?

so i decided to be a little naughty and use OwnedFd::from_raw_fd(-2) and there are no runtime panics due to assertions and all the tests are passing.

is this fine?

OwnedFd::from_raw_fd(RawFd::MAX) also works

anti-entropy123 commented 7 months ago

@zahash I think we can try using mmap_anonymous https://docs.rs/nix/latest/nix/sys/mman/fn.mmap_anonymous.html

anti-entropy123 commented 7 months ago

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

I tested this code snippet, and it ran fine.

fn verify_try_clone(fd: OwnedFd) {
    let new_fd = fd.try_clone().unwrap();
    drop(fd);

    let mut new_file = unsafe { File::from_raw_fd(new_fd.as_raw_fd()) };
    new_file.write_all("hello world".as_bytes()).unwrap();
}
anti-entropy123 commented 7 months ago

but the OwnedFd calls close on the wrapped RawFd when dropped.

@zahash The try_clone method of OwnedFd ultimately calls a syscall similar to dup2, so in the end, there are indeed two separate file descriptors. https://man7.org/linux/man-pages/man2/dup.2.html

zahash commented 7 months ago

@anti-entropy123 @utam0k

i was also thinking about having a ref datatype

pub enum ContainerTypeRef<'fd> {
    InitContainer,
    TenantContainer { exec_notify_id: BorrowedFd<'fd> }
}

that way, we may not have to necessarily clone all the time.

but the CloneCb in container_intermediate_process requires cloning the args.

    let cb: CloneCb = {
        let args = args.clone();
        ...
    };
zahash commented 7 months ago

@anti-entropy123 @utam0k

i have made changes related to issues https://github.com/containers/youki/issues/2417 and https://github.com/containers/youki/issues/2723

all the tests pass. Please checkout the patch file. if you like where this is going and satisfied with most of the changes, i will create a pull request for further review.

diff --git a/crates/libcgroups/src/v1/memory.rs b/crates/libcgroups/src/v1/memory.rs
index 98fa6b1e..38b0dd07 100644
--- a/crates/libcgroups/src/v1/memory.rs
+++ b/crates/libcgroups/src/v1/memory.rs
@@ -332,7 +332,7 @@ impl Memory {
             Err(e) => {
                 // we need to look into the raw OS error for an EBUSY status
                 match e.inner().raw_os_error() {
-                    Some(code) => match Errno::from_i32(code) {
+                    Some(code) => match Errno::from_raw(code) {
                         Errno::EBUSY => {
                             let usage = Self::get_memory_usage(cgroup_root)?;
                             let max_usage = Self::get_memory_max_usage(cgroup_root)?;
diff --git a/crates/libcontainer/src/container/builder_impl.rs b/crates/libcontainer/src/container/builder_impl.rs
index 38f2f1cc..59d7a062 100644
--- a/crates/libcontainer/src/container/builder_impl.rs
+++ b/crates/libcontainer/src/container/builder_impl.rs
@@ -132,7 +132,7 @@ impl ContainerBuilderImpl {
             prctl::set_dumpable(false).map_err(|e| {
                 LibcontainerError::Other(format!(
                     "error in setting dumpable to false : {}",
-                    nix::errno::from_i32(e)
+                    nix::errno::Errno::from_raw(e)
                 ))
             })?;
         }
@@ -141,7 +141,7 @@ impl ContainerBuilderImpl {
         // therefore we will have to move all the variable by value. Since self
         // is a shared reference, we have to clone these variables here.
         let container_args = ContainerArgs {
-            container_type: self.container_type,
+            container_type: self.container_type.clone(),
             syscall: self.syscall,
             spec: Rc::clone(&self.spec),
             rootfs: self.rootfs.to_owned(),
diff --git a/crates/libcontainer/src/container/tenant_builder.rs b/crates/libcontainer/src/container/tenant_builder.rs
index 0d805a0d..2c2dbc88 100644
--- a/crates/libcontainer/src/container/tenant_builder.rs
+++ b/crates/libcontainer/src/container/tenant_builder.rs
@@ -1,6 +1,6 @@
 use caps::Capability;
 use nix::fcntl::OFlag;
-use nix::unistd::{self, close, pipe2, read, Pid};
+use nix::unistd::{self, pipe2, read, Pid};
 use oci_spec::runtime::{
     Capabilities as SpecCapabilities, Capability as SpecCapability, LinuxBuilder,
     LinuxCapabilities, LinuxCapabilitiesBuilder, LinuxNamespace, LinuxNamespaceBuilder,
@@ -8,6 +8,7 @@ use oci_spec::runtime::{
 };
 use procfs::process::Namespace;

+use std::os::fd::AsRawFd;
 use std::rc::Rc;
 use std::{
     collections::HashMap,
@@ -148,13 +149,13 @@ impl TenantContainerBuilder {
         let mut notify_socket = NotifySocket::new(notify_path);
         notify_socket.notify_container_start()?;

-        close(write_end).map_err(LibcontainerError::OtherSyscall)?;
+        drop(builder_impl); // this will close write_end fd

         let mut err_str_buf = Vec::new();

         loop {
             let mut buf = [0; 3];
-            match read(read_end, &mut buf).map_err(LibcontainerError::OtherSyscall)? {
+            match read(read_end.as_raw_fd(), &mut buf).map_err(LibcontainerError::OtherSyscall)? {
                 0 => {
                     if err_str_buf.is_empty() {
                         return Ok(pid);
diff --git a/crates/libcontainer/src/process/args.rs b/crates/libcontainer/src/process/args.rs
index ad37c07b..11c5d7f2 100644
--- a/crates/libcontainer/src/process/args.rs
+++ b/crates/libcontainer/src/process/args.rs
@@ -1,5 +1,6 @@
 use libcgroups::common::CgroupConfig;
 use oci_spec::runtime::Spec;
+use std::os::fd::OwnedFd;
 use std::os::unix::prelude::RawFd;
 use std::path::PathBuf;
 use std::rc::Rc;
@@ -9,10 +10,21 @@ use crate::notify_socket::NotifyListener;
 use crate::syscall::syscall::SyscallType;
 use crate::user_ns::UserNamespaceConfig;
 use crate::workload::Executor;
-#[derive(Debug, Copy, Clone)]
+#[derive(Debug)]
 pub enum ContainerType {
     InitContainer,
-    TenantContainer { exec_notify_fd: RawFd },
+    TenantContainer { exec_notify_fd: OwnedFd },
+}
+
+impl Clone for ContainerType {
+    fn clone(&self) -> Self {
+        match self {
+            Self::InitContainer => Self::InitContainer,
+            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer {
+                exec_notify_fd: exec_notify_fd.try_clone().unwrap(/* is unwrap ok? */),
+            },
+        }
+    }
 }

 #[derive(Clone)]
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..7e5ac9c0 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -1,3 +1,5 @@
+use std::os::fd::AsRawFd;
+
 use crate::error::MissingSpecError;
 use crate::{namespaces::Namespaces, process::channel, process::fork};
 use libcgroups::common::CgroupManager;
@@ -128,12 +130,12 @@ pub fn container_intermediate_process(
                     if let Err(err) = main_sender.exec_failed(e.to_string()) {
                         tracing::error!(?err, "failed sending error to main sender");
                     }
-                    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
+                    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
                         let buf = format!("{e}");
-                        if let Err(err) = write(exec_notify_fd, buf.as_bytes()) {
+                        if let Err(err) = write(&exec_notify_fd, buf.as_bytes()) {
                             tracing::error!(?err, "failed to write to exec notify fd");
                         }
-                        if let Err(err) = close(exec_notify_fd) {
+                        if let Err(err) = close(exec_notify_fd.as_raw_fd()) {
                             tracing::error!(?err, "failed to close exec notify fd");
                         }
                     }
@@ -157,8 +159,8 @@ pub fn container_intermediate_process(
     })?;

     // Close the exec_notify_fd in this process
-    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
-        close(exec_notify_fd).map_err(|err| {
+    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
+        close(exec_notify_fd.as_raw_fd()).map_err(|err| {
             tracing::error!("failed to close exec notify fd: {}", err);
             IntermediateProcessError::ExecNotify(err)
         })?;
@@ -206,7 +208,7 @@ fn setup_userns(
     prctl::set_dumpable(true).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumpable to true : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     sender.identifier_mapping_request().map_err(|err| {
@@ -220,7 +222,7 @@ fn setup_userns(
     prctl::set_dumpable(false).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumplable to false : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     Ok(())
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..60a33f46 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -1,4 +1,4 @@
-use std::{ffi::c_int, fs::File, num::NonZeroUsize};
+use std::{ffi::c_int, num::NonZeroUsize};

 use libc::SIGCHLD;
 use nix::{
@@ -164,15 +164,17 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
     // do not use MAP_GROWSDOWN since it is not well supported.
     // Ref: https://man7.org/linux/man-pages/man2/mmap.2.html
     let child_stack = unsafe {
-        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd`.
-        // `::<File>` doesn't have any meaning because we won't use it.
-        mman::mmap::<File>(
+        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd` and takes f: Option<F>.
+        // if f: None, then fd = -1 is used; (-1 is just an invalid fd)
+        // let fd = f.map(|f| f.as_fd().as_raw_fd()).unwrap_or(-1);
+        // Since nix = "0.28.0" mmap function accepts f: F instead of Option<F>
+        // but it is not possible to manually create OwnedFd::from_raw_fd(-1) because rust will panic due to assert
+        // So, mmap_anonymous is used which achieves the exact same behaviour
+        mman::mmap_anonymous(
             None,
             NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
             mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
             mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
-            None,
-            0,
         )
         .map_err(CloneError::StackAllocation)?
     };
@@ -187,7 +189,7 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone

     // Since the child stack for clone grows downward, we need to pass in
     // the top of the stack address.
-    let child_stack_top = unsafe { child_stack.add(default_stack_size) };
+    let child_stack_top = unsafe { child_stack.as_ptr().add(default_stack_size) };

     // Combine the clone flags with exit signals.
     let combined_flags = (flags | exit_signal.unwrap_or(0)) as c_int;
diff --git a/crates/libcontainer/src/seccomp/mod.rs b/crates/libcontainer/src/seccomp/mod.rs
index 2bfce4cb..a1acb131 100644
--- a/crates/libcontainer/src/seccomp/mod.rs
+++ b/crates/libcontainer/src/seccomp/mod.rs
@@ -332,7 +332,7 @@ mod tests {
             }

             if let Some(errno) = ret.err() {
-                if errno != nix::errno::from_i32(expect_error) {
+                if errno != nix::errno::Errno::from_raw(expect_error) {
                     Err(TestCallbackError::Custom(format!(
                         "getcwd failed but we didn't get the expected error from seccomp profile: {}",
                         errno
diff --git a/crates/libcontainer/src/syscall/linux.rs b/crates/libcontainer/src/syscall/linux.rs
index d01c4a54..35e5430c 100644
--- a/crates/libcontainer/src/syscall/linux.rs
+++ b/crates/libcontainer/src/syscall/linux.rs
@@ -314,7 +314,7 @@ impl Syscall for LinuxSyscall {
     fn set_id(&self, uid: Uid, gid: Gid) -> Result<()> {
         prctl::set_keep_capabilities(true).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to true");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         // args : real *id, effective *id, saved set *id respectively

@@ -350,7 +350,7 @@ impl Syscall for LinuxSyscall {
         }
         prctl::set_keep_capabilities(false).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to false");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         Ok(())
     }
diff --git a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
index 10664b2e..60337c4d 100644
--- a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
+++ b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
@@ -1,7 +1,7 @@
 use anyhow::{bail, Context, Result};
 use libcontainer::container::ContainerProcessState;
 use nix::{
-    sys::socket::{self, UnixAddr},
+    sys::socket::{self, Backlog, UnixAddr},
     unistd,
 };
 use std::{
@@ -35,7 +35,11 @@ pub fn recv_seccomp_listener(seccomp_listener: &Path) -> SeccompAgentResult {
     socket::bind(socket.as_raw_fd(), &addr).context("failed to bind to seccomp listener socket")?;
     // Force the backlog to be 1 so in the case of an error, only one connection
     // from clients will be waiting.
-    socket::listen(&socket.as_fd(), 1).context("failed to listen on seccomp listener")?;
+    socket::listen(
+        &socket.as_fd(),
+        Backlog::new(1).unwrap( /* safe to unwrap because Backlog(1) is always valid */ ),
+    )
+    .context("failed to listen on seccomp listener")?;
     let conn = match socket::accept(socket.as_raw_fd()) {
         Ok(conn) => conn,
         Err(e) => {
diff --git a/tests/contest/runtimetest/src/tests.rs b/tests/contest/runtimetest/src/tests.rs
index dee3afc7..d1d2f2ce 100644
--- a/tests/contest/runtimetest/src/tests.rs
+++ b/tests/contest/runtimetest/src/tests.rs
@@ -34,7 +34,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
     // change manual matching of i32 to e.kind() and match statement
     for path in ro_paths {
         if let std::io::Result::Err(e) = test_read_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // to be specified in the spec, so we allow ENOENT here
             if errno == Errno::ENOENT {
@@ -50,7 +50,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
         }

         if let std::io::Result::Err(e) = test_write_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // being specified in the spec, so we allow ENOENT, and we expect EROFS as the paths
             // should be read-only
YJDoc2 commented 7 months ago

Hey @zahash , I'd suggest you wait till #2728 is merged, as that changes some of the things mentioned here, eg the mmap_anonymous. Also instead of using patch files, please either submit a PR, so link to a branch in your repo, as that would be easier for everyone to check and see.

zahash commented 7 months ago

@YJDoc2 oh, didn't realize there was already a pr to update nix. Fine then i will wait till it gets merged and then continue my work on the "Fully support OwnedFd" part

zahash commented 7 months ago

also, @utam0k can i also be assigned to this issue?

YJDoc2 commented 7 months ago

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

anti-entropy123 commented 7 months ago

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. :thinking: (In the past, this wasn't a problem because they were all the same RawFd.)

oirom commented 7 months ago

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

Now this issue is too difficult to me, so please re-assign to @zahash

YJDoc2 commented 7 months ago

Now this issue is too difficult to me, so please re-assign to @zahash

No worries, feel free to take a look at some other issues such as #361 if you still want to contribute!

@zahash , assigning to you :+1:

zahash commented 7 months ago

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)

@anti-entropy123 what about this? in the below line?

unsafe { drop(Box::from_raw(data)) };
anti-entropy123 commented 7 months ago

I think it seems only the parent process can reach this point. Additionally, this is still dropping CloneCb rather than releasing data outside of the closure.

zahash commented 7 months ago

@anti-entropy123

i don't think i fully understand what you are saying. can you please explain what you mean by "data outside of the closure"

can you also please explain a bit what the clone and clone3 functions are doing? its a bit hard to understand 😅

zahash commented 7 months ago

@YJDoc2 is it feasible to use OwnedFd in Sender and Receiver? or is it too big of a change?

pub struct Receiver<T> {
    receiver: OwnedFd,
    phantom: PhantomData<T>,
}

pub struct Sender<T> {
    sender: OwnedFd,
    phantom: PhantomData<T>,
}

but OwnedFd cannot be cloned, especially when it comes to creating a CloneCb in container_main_process and container_intermediate_process. So, what should be the strategy here?

anti-entropy123 commented 7 months ago

@zahash Here's my understanding of execution steps, but it might not be accurate...

  1. When creating CloneCb, container_args along with its internal OwnedFd are cloned, and then moved into the closure.
  2. Subsequently, clone() is called to create a child process. The child process inherits all files opened by the parent process. (You can confirm this by logging)
    fn get_current_fds_nums() -> usize {
    let proc_fd_dir = format!("/proc/{}/fd", process::id());
    fs::read_dir(proc_fd_dir).unwrap().collect::<Vec<_>>().len()
    }

    image image image

  3. However, the entrypoint of the child process is set to a special main function, which simply calls the provided closure. The child process loses ownership of any variables outside of the closure. (This includes the original container_args as well.) So, it doesn't have the opportunity to execute the drop() function for those variables. image

Even though clone3() or clone() might be executed based on the system environment, the result should be the same.

zahash commented 7 months ago

@anti-entropy123

i tried to avoid cloning the ContainerArgs in the CloneCb and it complains that the callback cb might outlive the &args (or atleast thats what rust thinks).

let cb: CloneCb = {
        Box::new(move || {
            // ...
        })
    };

but aren't we are waiting for the child process to finish so that it won't outlive the callback? is there a "idiomatic" way to do this so rust can understand?

    let pid = fork::container_clone_sibling(cb).map_err(|err| {
        tracing::error!("failed to fork init process: {}", err);
        IntermediateProcessError::InitProcess(err)
    })?;

    // ...

    main_sender.intermediate_ready(pid).map_err(|err| {
        tracing::error!("failed to wait on intermediate process: {}", err);
        err
    })?;

i can try adding a lifetime parameter to CloneCb

pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;
zahash commented 7 months ago

Also, cloning a OwnedFd only fails on wasm.

impl OwnedFd {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `OwnedFd` instance.
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone(&self) -> crate::io::Result<Self> {
        self.as_fd().try_clone_to_owned()
    }
}

impl BorrowedFd<'_> {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(not(any(target_arch = "wasm32", target_os = "hermit")))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        // We want to atomically duplicate this file descriptor and set the
        // CLOEXEC flag, and currently that's done via F_DUPFD_CLOEXEC. This
        // is a POSIX flag that was added to Linux in 2.6.24.
        #[cfg(not(target_os = "espidf"))]
        let cmd = libc::F_DUPFD_CLOEXEC;

        // For ESP-IDF, F_DUPFD is used instead, because the CLOEXEC semantics
        // will never be supported, as this is a bare metal framework with
        // no capabilities for multi-process execution. While F_DUPFD is also
        // not supported yet, it might be (currently it returns ENOSYS).
        #[cfg(target_os = "espidf")]
        let cmd = libc::F_DUPFD;

        // Avoid using file descriptors below 3 as they are used for stdio
        let fd = cvt(unsafe { libc::fcntl(self.as_raw_fd(), cmd, 3) })?;
        Ok(unsafe { OwnedFd::from_raw_fd(fd) })
    }

    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(any(target_arch = "wasm32", target_os = "hermit"))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        Err(crate::io::const_io_error!(
            crate::io::ErrorKind::Unsupported,
            "operation not supported on WASI yet",
        ))
    }
}
anti-entropy123 commented 7 months ago

is there a "idiomatic" way to do this so rust can understand?

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack. So, at the moment, I don't have a good idea to solve this problem... (Apart from using Rc and manually adjusting the reference count.) @zahash

zahash commented 7 months ago

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack.

@anti-entropy123

i thought clone creates a process identical to the parent with all the local variables. but they both will have separate memory spaces so modification done on the child won't affect parent. besides, i don't think we are doing any modifications. (i might be wrong).

also, the CloneCb closure is boxed and we send the actual function pointer of the closure to libc. so shouldn't the closure retain all the references to the enclosed variables in the function body.

zahash commented 7 months ago

@anti-entropy123 i did this and all the tests pass. I basically removed #[derive(Clone)] from Senders and Receivers and added a lifetime to CloneCb

@YJDoc2 i can't create a PR right now because this hasn't merged yet https://github.com/containers/youki/pull/2728 and a lot of my changes depend on it. sorry for using a patch again.

diff --git a/crates/libcontainer/src/channel.rs b/crates/libcontainer/src/channel.rs
index 23d255f9..427912d6 100644
--- a/crates/libcontainer/src/channel.rs
+++ b/crates/libcontainer/src/channel.rs
@@ -18,13 +18,11 @@ pub enum ChannelError {
     #[error("channel connection broken")]
     BrokenChannel,
 }
-#[derive(Clone)]
 pub struct Receiver<T> {
     receiver: RawFd,
     phantom: PhantomData<T>,
 }

-#[derive(Clone)]
 pub struct Sender<T> {
     sender: RawFd,
     phantom: PhantomData<T>,
diff --git a/crates/libcontainer/src/process/channel.rs b/crates/libcontainer/src/process/channel.rs
index 9683e5ed..b1bd2e88 100644
--- a/crates/libcontainer/src/process/channel.rs
+++ b/crates/libcontainer/src/process/channel.rs
@@ -40,7 +40,6 @@ pub fn main_channel() -> Result<(MainSender, MainReceiver), ChannelError> {
     Ok((MainSender { sender }, MainReceiver { receiver }))
 }

-#[derive(Clone)]
 pub struct MainSender {
     sender: Sender<Message>,
 }
@@ -88,7 +87,6 @@ impl MainSender {
     }
 }

-#[derive(Clone)]
 pub struct MainReceiver {
     receiver: Receiver<Message>,
 }
@@ -199,7 +197,6 @@ pub fn intermediate_channel() -> Result<(IntermediateSender, IntermediateReceive
     ))
 }

-#[derive(Clone)]
 pub struct IntermediateSender {
     sender: Sender<Message>,
 }
@@ -219,7 +216,6 @@ impl IntermediateSender {
     }
 }

-#[derive(Clone)]
 pub struct IntermediateReceiver {
     receiver: Receiver<Message>,
 }
@@ -256,7 +252,6 @@ pub fn init_channel() -> Result<(InitSender, InitReceiver), ChannelError> {
     Ok((InitSender { sender }, InitReceiver { receiver }))
 }

-#[derive(Clone)]
 pub struct InitSender {
     sender: Sender<Message>,
 }
@@ -275,7 +270,6 @@ impl InitSender {
     }
 }

-#[derive(Clone)]
 pub struct InitReceiver {
     receiver: Receiver<Message>,
 }
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..8d9a015c 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -99,12 +99,7 @@ pub fn container_intermediate_process(
     }

     let cb: CloneCb = {
-        let args = args.clone();
-        let init_sender = init_sender.clone();
-        let inter_sender = inter_sender.clone();
-        let mut main_sender = main_sender.clone();
-        let mut init_receiver = init_receiver.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[2:INIT]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
@@ -121,7 +116,7 @@ pub fn container_intermediate_process(
                 tracing::error!(?err, "failed to close sender in the intermediate process");
                 return -1;
             }
-            match container_init_process(&args, &mut main_sender, &mut init_receiver) {
+            match container_init_process(&args, main_sender, init_receiver) {
                 Ok(_) => 0,
                 Err(e) => {
                     tracing::error!("failed to initialize container process: {e}");
diff --git a/crates/libcontainer/src/process/container_main_process.rs b/crates/libcontainer/src/process/container_main_process.rs
index f7066053..41fa96ab 100644
--- a/crates/libcontainer/src/process/container_main_process.rs
+++ b/crates/libcontainer/src/process/container_main_process.rs
@@ -42,16 +42,12 @@ pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bo
     // cloned process, we have to be deligent about closing any unused channel.
     // At minimum, we have to close down any unused senders. The corresponding
     // receivers will be cleaned up once the senders are closed down.
-    let (main_sender, mut main_receiver) = channel::main_channel()?;
-    let inter_chan = channel::intermediate_channel()?;
-    let init_chan = channel::init_channel()?;
+    let (mut main_sender, mut main_receiver) = channel::main_channel()?;
+    let mut inter_chan = channel::intermediate_channel()?;
+    let mut init_chan = channel::init_channel()?;

     let cb: CloneCb = {
-        let container_args = container_args.clone();
-        let mut main_sender = main_sender.clone();
-        let mut inter_chan = inter_chan.clone();
-        let mut init_chan = init_chan.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[1:INTER]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..2a69d8d2 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -33,7 +33,7 @@ pub enum CloneError {
 /// pass in a child stack, which is empty. By storing the closure in heap, we
 /// can then in the new process to re-box the heap memory back to a closure
 /// correctly.
-pub type CloneCb = Box<dyn FnMut() -> i32>;
+pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;

 // Clone a sibling process that shares the same parent as the calling
 // process. This is used to launch the container init process so the parent
zahash commented 7 months ago

the lifetime in CloneCb<_> is only there to satisfy rust. i think the actual reason this is working is because we actually wait for the child process to finish.

    let pid = fork::container_clone_sibling(cb).map_err(|err| {
        tracing::error!("failed to fork init process: {}", err);
        IntermediateProcessError::InitProcess(err)
    })?;

    // ...

    main_sender.intermediate_ready(pid).map_err(|err| {
        tracing::error!("failed to wait on intermediate process: {}", err);
        err
    })?;

without this, the child might outlive the parent and become zombie. and rust lifetimes won't save us.

maybe there is a better way to do this so that rust understands the "waiting" part is important for the lifetime.

anti-entropy123 commented 7 months ago

OK, I agree with what you're saying, and since your code doesn't clone the container_args variable, there should be no copying of file descriptors. Therefore, fd leaks shouldn't occur. :+1:

However, I'd still like to ask if you're aware of whether you're actually calling clone3() or clone()?

// An internal wrapper to manage the clone3 vs clone fallback logic.
fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    match clone3(&mut cb, flags, exit_signal) {
        ...
        // For now, we decide to only fallback on ENOSYS
        Err(CloneError::Clone(nix::Error::ENOSYS)) => {
            ...
            let pid = clone(cb, flags, exit_signal)?;

Since in clone(), a new stack is allocated for the child process,

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    let ret = unsafe {
            libc::clone(
                main,
                child_stack_top,
                combined_flags,
                data as *mut libc::c_void,
            )
        };
    ...

wouldn't the closure's references to local variables (e.g., &container_args) become invalid? @zahash

zahash commented 7 months ago

@anti-entropy123

I have the exact same question 😅😅

do you know how clone3 vs clone differences are? and since the closure has references to variables in the paren'ts memory, do they also get copied to the child memory?

i can see that clone3 also takes args but clone doesn't. (but i don't think we are passing anything related in it)

but why do the tests pass? do they even check this part of the code?

anti-entropy123 commented 7 months ago

I think your test device should support clone3(), so it likely called clone3() in reality. And clone3() seems to completely copy the parent process's memory, which is why your test passed.

zahash commented 7 months ago

@anti-entropy123 is there a way to force it to use clone instead?

never mind. i'll just comment out the clone3 part

anti-entropy123 commented 7 months ago

But I'm not entirely sure why clone() necessitates switching to a new stack; it could be a requirement of the syscall itself (? :thinking: ).

zahash commented 7 months ago

@anti-entropy123 all the tests still pass

fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    clone(cb, flags, exit_signal)
}
anti-entropy123 commented 7 months ago

all the tests still pass

Amazing... I'm curious about the reason. @zahash

zahash commented 7 months ago

@anti-entropy123

Amazing... I'm curious about the reason

I'm equally curious. 😂

maybe this is undefined behavior.

anti-entropy123 commented 7 months ago

maybe this is undefined behavior

My expectation was that a segmentation fault would occur.

zahash commented 7 months ago

@anti-entropy123 if found this on the kernel man page

The clone() wrapper function
       When the child process is created with the clone() wrapper
       function, it commences execution by calling the function pointed
       to by the argument fn.  (This differs from [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html), where
       execution continues in the child from the point of the [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html)
       call.)  The arg argument is passed as the argument of the
       function fn.

       When the fn(arg) function returns, the child process terminates.
       The integer returned by fn is the exit status for the child
       process.  The child process may also terminate explicitly by
       calling [exit(2)](https://man7.org/linux/man-pages/man2/exit.2.html) or after receiving a fatal signal.

       The stack argument specifies the location of the stack used by
       the child process.  Since the child and calling process may share
       memory, it is not possible for the child process to execute in
       the same stack as the calling process.  The calling process must
       therefore set up memory space for the child stack and pass a
       pointer to this space to clone().  Stacks grow downward on all
       processors that run Linux (except the HP PA processors), so stack
       usually points to the topmost address of the memory space set up
       for the child stack.  Note that clone() does not provide a means
       whereby the caller can inform the kernel of the size of the stack
       area.

https://man7.org/linux/man-pages/man2/clone.2.html

zahash commented 7 months ago

what about others? what do they think? @utam0k @YJDoc2 @yihuaf

anti-entropy123 commented 7 months ago

I think I understand now. Even though clone() assigns a new stack to the child process, the original memory region (the old stack) may still be valid. So, those references remain effective. Do you think so? @zahash

YJDoc2 commented 7 months ago

Hey, will need some time to read all the discussion above and consider it. Probably would get back on this around the weekend....

zahash commented 7 months ago

@anti-entropy123

maybe. but how can one process directly access the memory of another process? its not allowed right?

or maybe its allowed in "parent - child" processes?

or maybe its because, the child gets an exact copy of the parent's memory when clone/fork