Open anti-entropy123 opened 1 year ago
Some ideas
The main challenge with using OwnedFd
is that ContainerArgs
requires its fields to satisfy the Clone
trait, but OwnedFd
does not (and shouldn't, in my opinion) support it.
Maybe we can impl Clone
for structs like Sender
, in which we can use OwnedFd::try_clone()
to safely share underlying sockets or connections. If we do this, we also need to check whether old FD will be leaked due to CloneCb
.
Hi, can i try this issue?? I'm very newbie for rust and container runtime, so it will be take a bit long time I think...
Don't worry ;)
I'm very newbie for rust and container runtime, so it will be take a bit long time I think...
@anti-entropy123 what should happen if OwnedFd::try_clone()
returns Err?
say, when manually implementing the Clone trait for ContainerType
impl Clone for ContainerType {
fn clone(&self) -> Self {
match self {
Self::InitContainer => Self::InitContainer,
Self::TenantContainer { exec_notify_fd } => Self::TenantContainer { exec_notify_fd: exec_notify_fd.try_clone().unwrap() },
}
}
}
is unwrapping fine?
i have a question. lets say i try_clone()
a OwnedFd
a bunch of times.
let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();
what if one of those variables is dropped? will the file close?
if thats the case, then thats a big problem because having a OwnedFd
instance assumes that the file is open
does it make sense to do this maybe?
pub enum ContainerType {
InitContainer,
TenantContainer { exec_notify_fd: Rc<OwnedFd> },
}
what should happen if
OwnedFd::try_clone()
returns Err?say, when manually implementing the Clone trait for
ContainerType
Directly unwrapping indeed doesn't seem quite appropriate...
Could we just implement a try_clone
method for structures like ContainerType
(should also include ContainerArgs
), rather than derive(Clone)
? This way, errors can be passed, reusing the error handling logic from before. 👀
This suggestion might not be fully matured, but I hope it helps you. @zahash
i have a question. lets say i
try_clone()
aOwnedFd
a bunch of times.let a = f.try_clone().unwrap(); let b = f.try_clone().unwrap();
what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a
OwnedFd
instance assumes that the file is open
@anti-entropy123 what about this issue then?
what about this issue then?
I think closing one of the file descriptors won't affect the usability of the other one.
@anti-entropy123
but the OwnedFd
calls close
on the wrapped RawFd
when dropped.
#[stable(feature = "io_safety", since = "1.63.0")]
impl Drop for OwnedFd {
#[inline]
fn drop(&mut self) {
unsafe {
// Note that errors are ignored when closing a file descriptor. The
// reason for this is that if an error occurs we don't actually know if
// the file descriptor was closed or not, and if we retried (for
// something like EINTR), we might close another valid file descriptor
// opened after we closed ours.
#[cfg(not(target_os = "hermit"))]
let _ = libc::close(self.fd);
#[cfg(target_os = "hermit")]
let _ = hermit_abi::close(self.fd);
}
}
}
@zahash Sorry, I don't have a suitable Linux machine with me right now. I can try verifying the issue tomorrow morning.
TenantContainer { exec_notify_fd: Rc
},
I've discussed this idea before; you can refer to this link for more information: https://github.com/containers/youki/pull/2369#issuecomment-1746761562
@anti-entropy123 @utam0k does cargo test
run all the available tests?
also, mmap no longer accepts f: Option<F>
and instead just accepts f: F
and with in mmap, it tries to unwrap_or(-1) to get the raw fd.
but now, i can't find a way to create a invalid Fd OwnedFd::from_raw_fd(-1)
panics at runtime because rust doesn't allow -1 (asserts that fd != -1)
mman::mmap(
None,
NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
OwnedFd::from_raw_fd(-1), // used to be None
0,
)
.map_err(CloneError::StackAllocation)?
so i decided to be a little naughty and use OwnedFd::from_raw_fd(-2)
and there are no runtime panics due to assertions and all the tests are passing.
is this fine?
OwnedFd::from_raw_fd(RawFd::MAX)
also works
@zahash I think we can try using mmap_anonymous
https://docs.rs/nix/latest/nix/sys/mman/fn.mmap_anonymous.html
what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open
I tested this code snippet, and it ran fine.
fn verify_try_clone(fd: OwnedFd) {
let new_fd = fd.try_clone().unwrap();
drop(fd);
let mut new_file = unsafe { File::from_raw_fd(new_fd.as_raw_fd()) };
new_file.write_all("hello world".as_bytes()).unwrap();
}
but the OwnedFd calls close on the wrapped RawFd when dropped.
@zahash The try_clone method of OwnedFd ultimately calls a syscall similar to dup2, so in the end, there are indeed two separate file descriptors. https://man7.org/linux/man-pages/man2/dup.2.html
@anti-entropy123 @utam0k
i was also thinking about having a ref datatype
pub enum ContainerTypeRef<'fd> {
InitContainer,
TenantContainer { exec_notify_id: BorrowedFd<'fd> }
}
that way, we may not have to necessarily clone all the time.
but the CloneCb
in container_intermediate_process
requires cloning the args.
let cb: CloneCb = {
let args = args.clone();
...
};
@anti-entropy123 @utam0k
i have made changes related to issues https://github.com/containers/youki/issues/2417 and https://github.com/containers/youki/issues/2723
all the tests pass. Please checkout the patch file. if you like where this is going and satisfied with most of the changes, i will create a pull request for further review.
diff --git a/crates/libcgroups/src/v1/memory.rs b/crates/libcgroups/src/v1/memory.rs
index 98fa6b1e..38b0dd07 100644
--- a/crates/libcgroups/src/v1/memory.rs
+++ b/crates/libcgroups/src/v1/memory.rs
@@ -332,7 +332,7 @@ impl Memory {
Err(e) => {
// we need to look into the raw OS error for an EBUSY status
match e.inner().raw_os_error() {
- Some(code) => match Errno::from_i32(code) {
+ Some(code) => match Errno::from_raw(code) {
Errno::EBUSY => {
let usage = Self::get_memory_usage(cgroup_root)?;
let max_usage = Self::get_memory_max_usage(cgroup_root)?;
diff --git a/crates/libcontainer/src/container/builder_impl.rs b/crates/libcontainer/src/container/builder_impl.rs
index 38f2f1cc..59d7a062 100644
--- a/crates/libcontainer/src/container/builder_impl.rs
+++ b/crates/libcontainer/src/container/builder_impl.rs
@@ -132,7 +132,7 @@ impl ContainerBuilderImpl {
prctl::set_dumpable(false).map_err(|e| {
LibcontainerError::Other(format!(
"error in setting dumpable to false : {}",
- nix::errno::from_i32(e)
+ nix::errno::Errno::from_raw(e)
))
})?;
}
@@ -141,7 +141,7 @@ impl ContainerBuilderImpl {
// therefore we will have to move all the variable by value. Since self
// is a shared reference, we have to clone these variables here.
let container_args = ContainerArgs {
- container_type: self.container_type,
+ container_type: self.container_type.clone(),
syscall: self.syscall,
spec: Rc::clone(&self.spec),
rootfs: self.rootfs.to_owned(),
diff --git a/crates/libcontainer/src/container/tenant_builder.rs b/crates/libcontainer/src/container/tenant_builder.rs
index 0d805a0d..2c2dbc88 100644
--- a/crates/libcontainer/src/container/tenant_builder.rs
+++ b/crates/libcontainer/src/container/tenant_builder.rs
@@ -1,6 +1,6 @@
use caps::Capability;
use nix::fcntl::OFlag;
-use nix::unistd::{self, close, pipe2, read, Pid};
+use nix::unistd::{self, pipe2, read, Pid};
use oci_spec::runtime::{
Capabilities as SpecCapabilities, Capability as SpecCapability, LinuxBuilder,
LinuxCapabilities, LinuxCapabilitiesBuilder, LinuxNamespace, LinuxNamespaceBuilder,
@@ -8,6 +8,7 @@ use oci_spec::runtime::{
};
use procfs::process::Namespace;
+use std::os::fd::AsRawFd;
use std::rc::Rc;
use std::{
collections::HashMap,
@@ -148,13 +149,13 @@ impl TenantContainerBuilder {
let mut notify_socket = NotifySocket::new(notify_path);
notify_socket.notify_container_start()?;
- close(write_end).map_err(LibcontainerError::OtherSyscall)?;
+ drop(builder_impl); // this will close write_end fd
let mut err_str_buf = Vec::new();
loop {
let mut buf = [0; 3];
- match read(read_end, &mut buf).map_err(LibcontainerError::OtherSyscall)? {
+ match read(read_end.as_raw_fd(), &mut buf).map_err(LibcontainerError::OtherSyscall)? {
0 => {
if err_str_buf.is_empty() {
return Ok(pid);
diff --git a/crates/libcontainer/src/process/args.rs b/crates/libcontainer/src/process/args.rs
index ad37c07b..11c5d7f2 100644
--- a/crates/libcontainer/src/process/args.rs
+++ b/crates/libcontainer/src/process/args.rs
@@ -1,5 +1,6 @@
use libcgroups::common::CgroupConfig;
use oci_spec::runtime::Spec;
+use std::os::fd::OwnedFd;
use std::os::unix::prelude::RawFd;
use std::path::PathBuf;
use std::rc::Rc;
@@ -9,10 +10,21 @@ use crate::notify_socket::NotifyListener;
use crate::syscall::syscall::SyscallType;
use crate::user_ns::UserNamespaceConfig;
use crate::workload::Executor;
-#[derive(Debug, Copy, Clone)]
+#[derive(Debug)]
pub enum ContainerType {
InitContainer,
- TenantContainer { exec_notify_fd: RawFd },
+ TenantContainer { exec_notify_fd: OwnedFd },
+}
+
+impl Clone for ContainerType {
+ fn clone(&self) -> Self {
+ match self {
+ Self::InitContainer => Self::InitContainer,
+ Self::TenantContainer { exec_notify_fd } => Self::TenantContainer {
+ exec_notify_fd: exec_notify_fd.try_clone().unwrap(/* is unwrap ok? */),
+ },
+ }
+ }
}
#[derive(Clone)]
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..7e5ac9c0 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -1,3 +1,5 @@
+use std::os::fd::AsRawFd;
+
use crate::error::MissingSpecError;
use crate::{namespaces::Namespaces, process::channel, process::fork};
use libcgroups::common::CgroupManager;
@@ -128,12 +130,12 @@ pub fn container_intermediate_process(
if let Err(err) = main_sender.exec_failed(e.to_string()) {
tracing::error!(?err, "failed sending error to main sender");
}
- if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
+ if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
let buf = format!("{e}");
- if let Err(err) = write(exec_notify_fd, buf.as_bytes()) {
+ if let Err(err) = write(&exec_notify_fd, buf.as_bytes()) {
tracing::error!(?err, "failed to write to exec notify fd");
}
- if let Err(err) = close(exec_notify_fd) {
+ if let Err(err) = close(exec_notify_fd.as_raw_fd()) {
tracing::error!(?err, "failed to close exec notify fd");
}
}
@@ -157,8 +159,8 @@ pub fn container_intermediate_process(
})?;
// Close the exec_notify_fd in this process
- if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
- close(exec_notify_fd).map_err(|err| {
+ if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
+ close(exec_notify_fd.as_raw_fd()).map_err(|err| {
tracing::error!("failed to close exec notify fd: {}", err);
IntermediateProcessError::ExecNotify(err)
})?;
@@ -206,7 +208,7 @@ fn setup_userns(
prctl::set_dumpable(true).map_err(|e| {
IntermediateProcessError::Other(format!(
"error in setting dumpable to true : {}",
- nix::errno::from_i32(e)
+ nix::errno::Errno::from_raw(e)
))
})?;
sender.identifier_mapping_request().map_err(|err| {
@@ -220,7 +222,7 @@ fn setup_userns(
prctl::set_dumpable(false).map_err(|e| {
IntermediateProcessError::Other(format!(
"error in setting dumplable to false : {}",
- nix::errno::from_i32(e)
+ nix::errno::Errno::from_raw(e)
))
})?;
Ok(())
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..60a33f46 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -1,4 +1,4 @@
-use std::{ffi::c_int, fs::File, num::NonZeroUsize};
+use std::{ffi::c_int, num::NonZeroUsize};
use libc::SIGCHLD;
use nix::{
@@ -164,15 +164,17 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
// do not use MAP_GROWSDOWN since it is not well supported.
// Ref: https://man7.org/linux/man-pages/man2/mmap.2.html
let child_stack = unsafe {
- // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd`.
- // `::<File>` doesn't have any meaning because we won't use it.
- mman::mmap::<File>(
+ // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd` and takes f: Option<F>.
+ // if f: None, then fd = -1 is used; (-1 is just an invalid fd)
+ // let fd = f.map(|f| f.as_fd().as_raw_fd()).unwrap_or(-1);
+ // Since nix = "0.28.0" mmap function accepts f: F instead of Option<F>
+ // but it is not possible to manually create OwnedFd::from_raw_fd(-1) because rust will panic due to assert
+ // So, mmap_anonymous is used which achieves the exact same behaviour
+ mman::mmap_anonymous(
None,
NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
- None,
- 0,
)
.map_err(CloneError::StackAllocation)?
};
@@ -187,7 +189,7 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
// Since the child stack for clone grows downward, we need to pass in
// the top of the stack address.
- let child_stack_top = unsafe { child_stack.add(default_stack_size) };
+ let child_stack_top = unsafe { child_stack.as_ptr().add(default_stack_size) };
// Combine the clone flags with exit signals.
let combined_flags = (flags | exit_signal.unwrap_or(0)) as c_int;
diff --git a/crates/libcontainer/src/seccomp/mod.rs b/crates/libcontainer/src/seccomp/mod.rs
index 2bfce4cb..a1acb131 100644
--- a/crates/libcontainer/src/seccomp/mod.rs
+++ b/crates/libcontainer/src/seccomp/mod.rs
@@ -332,7 +332,7 @@ mod tests {
}
if let Some(errno) = ret.err() {
- if errno != nix::errno::from_i32(expect_error) {
+ if errno != nix::errno::Errno::from_raw(expect_error) {
Err(TestCallbackError::Custom(format!(
"getcwd failed but we didn't get the expected error from seccomp profile: {}",
errno
diff --git a/crates/libcontainer/src/syscall/linux.rs b/crates/libcontainer/src/syscall/linux.rs
index d01c4a54..35e5430c 100644
--- a/crates/libcontainer/src/syscall/linux.rs
+++ b/crates/libcontainer/src/syscall/linux.rs
@@ -314,7 +314,7 @@ impl Syscall for LinuxSyscall {
fn set_id(&self, uid: Uid, gid: Gid) -> Result<()> {
prctl::set_keep_capabilities(true).map_err(|errno| {
tracing::error!(?errno, "failed to set keep capabilities to true");
- nix::errno::from_i32(errno)
+ nix::errno::Errno::from_raw(errno)
})?;
// args : real *id, effective *id, saved set *id respectively
@@ -350,7 +350,7 @@ impl Syscall for LinuxSyscall {
}
prctl::set_keep_capabilities(false).map_err(|errno| {
tracing::error!(?errno, "failed to set keep capabilities to false");
- nix::errno::from_i32(errno)
+ nix::errno::Errno::from_raw(errno)
})?;
Ok(())
}
diff --git a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
index 10664b2e..60337c4d 100644
--- a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
+++ b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
@@ -1,7 +1,7 @@
use anyhow::{bail, Context, Result};
use libcontainer::container::ContainerProcessState;
use nix::{
- sys::socket::{self, UnixAddr},
+ sys::socket::{self, Backlog, UnixAddr},
unistd,
};
use std::{
@@ -35,7 +35,11 @@ pub fn recv_seccomp_listener(seccomp_listener: &Path) -> SeccompAgentResult {
socket::bind(socket.as_raw_fd(), &addr).context("failed to bind to seccomp listener socket")?;
// Force the backlog to be 1 so in the case of an error, only one connection
// from clients will be waiting.
- socket::listen(&socket.as_fd(), 1).context("failed to listen on seccomp listener")?;
+ socket::listen(
+ &socket.as_fd(),
+ Backlog::new(1).unwrap( /* safe to unwrap because Backlog(1) is always valid */ ),
+ )
+ .context("failed to listen on seccomp listener")?;
let conn = match socket::accept(socket.as_raw_fd()) {
Ok(conn) => conn,
Err(e) => {
diff --git a/tests/contest/runtimetest/src/tests.rs b/tests/contest/runtimetest/src/tests.rs
index dee3afc7..d1d2f2ce 100644
--- a/tests/contest/runtimetest/src/tests.rs
+++ b/tests/contest/runtimetest/src/tests.rs
@@ -34,7 +34,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
// change manual matching of i32 to e.kind() and match statement
for path in ro_paths {
if let std::io::Result::Err(e) = test_read_access(path) {
- let errno = Errno::from_i32(e.raw_os_error().unwrap());
+ let errno = Errno::from_raw(e.raw_os_error().unwrap());
// In the integration tests we test for both existing and non-existing readonly paths
// to be specified in the spec, so we allow ENOENT here
if errno == Errno::ENOENT {
@@ -50,7 +50,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
}
if let std::io::Result::Err(e) = test_write_access(path) {
- let errno = Errno::from_i32(e.raw_os_error().unwrap());
+ let errno = Errno::from_raw(e.raw_os_error().unwrap());
// In the integration tests we test for both existing and non-existing readonly paths
// being specified in the spec, so we allow ENOENT, and we expect EROFS as the paths
// should be read-only
Hey @zahash , I'd suggest you wait till #2728 is merged, as that changes some of the things mentioned here, eg the mmap_anonymous. Also instead of using patch files, please either submit a PR, so link to a branch in your repo, as that would be easier for everyone to check and see.
@YJDoc2 oh, didn't realize there was already a pr to update nix. Fine then i will wait till it gets merged and then continue my work on the "Fully support OwnedFd" part
also, @utam0k can i also be assigned to this issue?
@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash
@zahash, have you considered the possibility of a FD leak here?
container_main_process.rs
pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
...
let cb: CloneCb = {
let container_args = container_args.clone();
...
Box::new(move || {
...
})
};
...
let intermediate_pid = fork::container_clone(cb)
fork.rs
fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
...
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
unsafe { Box::from_raw(data as *mut CloneCb)() }
}
let ret = unsafe {
libc::clone(
main,
child_stack_top,
combined_flags,
data as *mut libc::c_void,
)
};
...
It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. :thinking: (In the past, this wasn't a problem because they were all the same RawFd.)
@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash
Now this issue is too difficult to me, so please re-assign to @zahash
Now this issue is too difficult to me, so please re-assign to @zahash
No worries, feel free to take a look at some other issues such as #361 if you still want to contribute!
@zahash , assigning to you :+1:
@zahash, have you considered the possibility of a FD leak here?
container_main_process.rs
pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> { ... let cb: CloneCb = { let container_args = container_args.clone(); ... Box::new(move || { ... }) }; ... let intermediate_pid = fork::container_clone(cb)
fork.rs
fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> { ... extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { unsafe { Box::from_raw(data as *mut CloneCb)() } } let ret = unsafe { libc::clone( main, child_stack_top, combined_flags, data as *mut libc::c_void, ) }; ...
It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)
@anti-entropy123 what about this? in the below line?
unsafe { drop(Box::from_raw(data)) };
I think it seems only the parent process can reach this point. Additionally, this is still dropping CloneCb rather than releasing data outside of the closure.
@anti-entropy123
i don't think i fully understand what you are saying. can you please explain what you mean by "data outside of the closure"
can you also please explain a bit what the clone
and clone3
functions are doing? its a bit hard to understand 😅
@YJDoc2 is it feasible to use OwnedFd
in Sender
and Receiver
? or is it too big of a change?
pub struct Receiver<T> {
receiver: OwnedFd,
phantom: PhantomData<T>,
}
pub struct Sender<T> {
sender: OwnedFd,
phantom: PhantomData<T>,
}
but OwnedFd
cannot be cloned, especially when it comes to creating a CloneCb
in container_main_process
and container_intermediate_process
. So, what should be the strategy here?
@zahash Here's my understanding of execution steps, but it might not be accurate...
CloneCb
, container_args
along with its internal OwnedFd
are cloned, and then moved into the closure.fn get_current_fds_nums() -> usize {
let proc_fd_dir = format!("/proc/{}/fd", process::id());
fs::read_dir(proc_fd_dir).unwrap().collect::<Vec<_>>().len()
}
Even though clone3() or clone() might be executed based on the system environment, the result should be the same.
@anti-entropy123
i tried to avoid cloning the ContainerArgs
in the CloneCb
and it complains that the callback cb
might outlive the &args
(or atleast thats what rust thinks).
let cb: CloneCb = {
Box::new(move || {
// ...
})
};
but aren't we are waiting for the child process to finish so that it won't outlive the callback? is there a "idiomatic" way to do this so rust can understand?
let pid = fork::container_clone_sibling(cb).map_err(|err| {
tracing::error!("failed to fork init process: {}", err);
IntermediateProcessError::InitProcess(err)
})?;
// ...
main_sender.intermediate_ready(pid).map_err(|err| {
tracing::error!("failed to wait on intermediate process: {}", err);
err
})?;
i can try adding a lifetime parameter to CloneCb
pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;
Also, cloning a OwnedFd only fails on wasm.
impl OwnedFd {
/// Creates a new `OwnedFd` instance that shares the same underlying file
/// description as the existing `OwnedFd` instance.
#[stable(feature = "io_safety", since = "1.63.0")]
pub fn try_clone(&self) -> crate::io::Result<Self> {
self.as_fd().try_clone_to_owned()
}
}
impl BorrowedFd<'_> {
/// Creates a new `OwnedFd` instance that shares the same underlying file
/// description as the existing `BorrowedFd` instance.
#[cfg(not(any(target_arch = "wasm32", target_os = "hermit")))]
#[stable(feature = "io_safety", since = "1.63.0")]
pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
// We want to atomically duplicate this file descriptor and set the
// CLOEXEC flag, and currently that's done via F_DUPFD_CLOEXEC. This
// is a POSIX flag that was added to Linux in 2.6.24.
#[cfg(not(target_os = "espidf"))]
let cmd = libc::F_DUPFD_CLOEXEC;
// For ESP-IDF, F_DUPFD is used instead, because the CLOEXEC semantics
// will never be supported, as this is a bare metal framework with
// no capabilities for multi-process execution. While F_DUPFD is also
// not supported yet, it might be (currently it returns ENOSYS).
#[cfg(target_os = "espidf")]
let cmd = libc::F_DUPFD;
// Avoid using file descriptors below 3 as they are used for stdio
let fd = cvt(unsafe { libc::fcntl(self.as_raw_fd(), cmd, 3) })?;
Ok(unsafe { OwnedFd::from_raw_fd(fd) })
}
/// Creates a new `OwnedFd` instance that shares the same underlying file
/// description as the existing `BorrowedFd` instance.
#[cfg(any(target_arch = "wasm32", target_os = "hermit"))]
#[stable(feature = "io_safety", since = "1.63.0")]
pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
Err(crate::io::const_io_error!(
crate::io::ErrorKind::Unsupported,
"operation not supported on WASI yet",
))
}
}
is there a "idiomatic" way to do this so rust can understand?
I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack. So, at the moment, I don't have a good idea to solve this problem... (Apart from using Rc and manually adjusting the reference count.) @zahash
I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack.
@anti-entropy123
i thought clone creates a process identical to the parent with all the local variables. but they both will have separate memory spaces so modification done on the child won't affect parent. besides, i don't think we are doing any modifications. (i might be wrong).
also, the CloneCb closure is boxed and we send the actual function pointer of the closure to libc. so shouldn't the closure retain all the references to the enclosed variables in the function body.
@anti-entropy123 i did this and all the tests pass. I basically removed #[derive(Clone)]
from Sender
s and Receiver
s and added a lifetime to CloneCb
@YJDoc2 i can't create a PR right now because this hasn't merged yet https://github.com/containers/youki/pull/2728 and a lot of my changes depend on it. sorry for using a patch again.
diff --git a/crates/libcontainer/src/channel.rs b/crates/libcontainer/src/channel.rs
index 23d255f9..427912d6 100644
--- a/crates/libcontainer/src/channel.rs
+++ b/crates/libcontainer/src/channel.rs
@@ -18,13 +18,11 @@ pub enum ChannelError {
#[error("channel connection broken")]
BrokenChannel,
}
-#[derive(Clone)]
pub struct Receiver<T> {
receiver: RawFd,
phantom: PhantomData<T>,
}
-#[derive(Clone)]
pub struct Sender<T> {
sender: RawFd,
phantom: PhantomData<T>,
diff --git a/crates/libcontainer/src/process/channel.rs b/crates/libcontainer/src/process/channel.rs
index 9683e5ed..b1bd2e88 100644
--- a/crates/libcontainer/src/process/channel.rs
+++ b/crates/libcontainer/src/process/channel.rs
@@ -40,7 +40,6 @@ pub fn main_channel() -> Result<(MainSender, MainReceiver), ChannelError> {
Ok((MainSender { sender }, MainReceiver { receiver }))
}
-#[derive(Clone)]
pub struct MainSender {
sender: Sender<Message>,
}
@@ -88,7 +87,6 @@ impl MainSender {
}
}
-#[derive(Clone)]
pub struct MainReceiver {
receiver: Receiver<Message>,
}
@@ -199,7 +197,6 @@ pub fn intermediate_channel() -> Result<(IntermediateSender, IntermediateReceive
))
}
-#[derive(Clone)]
pub struct IntermediateSender {
sender: Sender<Message>,
}
@@ -219,7 +216,6 @@ impl IntermediateSender {
}
}
-#[derive(Clone)]
pub struct IntermediateReceiver {
receiver: Receiver<Message>,
}
@@ -256,7 +252,6 @@ pub fn init_channel() -> Result<(InitSender, InitReceiver), ChannelError> {
Ok((InitSender { sender }, InitReceiver { receiver }))
}
-#[derive(Clone)]
pub struct InitSender {
sender: Sender<Message>,
}
@@ -275,7 +270,6 @@ impl InitSender {
}
}
-#[derive(Clone)]
pub struct InitReceiver {
receiver: Receiver<Message>,
}
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..8d9a015c 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -99,12 +99,7 @@ pub fn container_intermediate_process(
}
let cb: CloneCb = {
- let args = args.clone();
- let init_sender = init_sender.clone();
- let inter_sender = inter_sender.clone();
- let mut main_sender = main_sender.clone();
- let mut init_receiver = init_receiver.clone();
- Box::new(move || {
+ Box::new(|| {
if let Err(ret) = prctl::set_name("youki:[2:INIT]") {
tracing::error!(?ret, "failed to set name for child process");
return ret;
@@ -121,7 +116,7 @@ pub fn container_intermediate_process(
tracing::error!(?err, "failed to close sender in the intermediate process");
return -1;
}
- match container_init_process(&args, &mut main_sender, &mut init_receiver) {
+ match container_init_process(&args, main_sender, init_receiver) {
Ok(_) => 0,
Err(e) => {
tracing::error!("failed to initialize container process: {e}");
diff --git a/crates/libcontainer/src/process/container_main_process.rs b/crates/libcontainer/src/process/container_main_process.rs
index f7066053..41fa96ab 100644
--- a/crates/libcontainer/src/process/container_main_process.rs
+++ b/crates/libcontainer/src/process/container_main_process.rs
@@ -42,16 +42,12 @@ pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bo
// cloned process, we have to be deligent about closing any unused channel.
// At minimum, we have to close down any unused senders. The corresponding
// receivers will be cleaned up once the senders are closed down.
- let (main_sender, mut main_receiver) = channel::main_channel()?;
- let inter_chan = channel::intermediate_channel()?;
- let init_chan = channel::init_channel()?;
+ let (mut main_sender, mut main_receiver) = channel::main_channel()?;
+ let mut inter_chan = channel::intermediate_channel()?;
+ let mut init_chan = channel::init_channel()?;
let cb: CloneCb = {
- let container_args = container_args.clone();
- let mut main_sender = main_sender.clone();
- let mut inter_chan = inter_chan.clone();
- let mut init_chan = init_chan.clone();
- Box::new(move || {
+ Box::new(|| {
if let Err(ret) = prctl::set_name("youki:[1:INTER]") {
tracing::error!(?ret, "failed to set name for child process");
return ret;
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..2a69d8d2 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -33,7 +33,7 @@ pub enum CloneError {
/// pass in a child stack, which is empty. By storing the closure in heap, we
/// can then in the new process to re-box the heap memory back to a closure
/// correctly.
-pub type CloneCb = Box<dyn FnMut() -> i32>;
+pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;
// Clone a sibling process that shares the same parent as the calling
// process. This is used to launch the container init process so the parent
the lifetime in CloneCb<_>
is only there to satisfy rust. i think the actual reason this is working is because we actually wait for the child process to finish.
let pid = fork::container_clone_sibling(cb).map_err(|err| {
tracing::error!("failed to fork init process: {}", err);
IntermediateProcessError::InitProcess(err)
})?;
// ...
main_sender.intermediate_ready(pid).map_err(|err| {
tracing::error!("failed to wait on intermediate process: {}", err);
err
})?;
without this, the child might outlive the parent and become zombie. and rust lifetimes won't save us.
maybe there is a better way to do this so that rust understands the "waiting" part is important for the lifetime.
OK, I agree with what you're saying, and since your code doesn't clone the container_args
variable, there should be no copying of file descriptors. Therefore, fd leaks shouldn't occur. :+1:
However, I'd still like to ask if you're aware of whether you're actually calling clone3()
or clone()
?
// An internal wrapper to manage the clone3 vs clone fallback logic.
fn clone_internal(
mut cb: CloneCb,
flags: u64,
exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
match clone3(&mut cb, flags, exit_signal) {
...
// For now, we decide to only fallback on ENOSYS
Err(CloneError::Clone(nix::Error::ENOSYS)) => {
...
let pid = clone(cb, flags, exit_signal)?;
Since in clone(), a new stack is allocated for the child process,
fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
...
let ret = unsafe {
libc::clone(
main,
child_stack_top,
combined_flags,
data as *mut libc::c_void,
)
};
...
wouldn't the closure's references to local variables (e.g., &container_args) become invalid? @zahash
@anti-entropy123
I have the exact same question 😅😅
do you know how clone3 vs clone differences are? and since the closure has references to variables in the paren'ts memory, do they also get copied to the child memory?
i can see that clone3 also takes args
but clone doesn't. (but i don't think we are passing anything related in it)
but why do the tests pass? do they even check this part of the code?
I think your test device should support clone3(), so it likely called clone3() in reality. And clone3() seems to completely copy the parent process's memory, which is why your test passed.
@anti-entropy123 is there a way to force it to use clone instead?
never mind. i'll just comment out the clone3 part
But I'm not entirely sure why clone() necessitates switching to a new stack; it could be a requirement of the syscall itself (? :thinking: ).
@anti-entropy123 all the tests still pass
fn clone_internal(
mut cb: CloneCb,
flags: u64,
exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
clone(cb, flags, exit_signal)
}
all the tests still pass
Amazing... I'm curious about the reason. @zahash
@anti-entropy123
Amazing... I'm curious about the reason
I'm equally curious. 😂
maybe this is undefined behavior.
maybe this is undefined behavior
My expectation was that a segmentation fault would occur.
@anti-entropy123 if found this on the kernel man page
The clone() wrapper function
When the child process is created with the clone() wrapper
function, it commences execution by calling the function pointed
to by the argument fn. (This differs from [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html), where
execution continues in the child from the point of the [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html)
call.) The arg argument is passed as the argument of the
function fn.
When the fn(arg) function returns, the child process terminates.
The integer returned by fn is the exit status for the child
process. The child process may also terminate explicitly by
calling [exit(2)](https://man7.org/linux/man-pages/man2/exit.2.html) or after receiving a fatal signal.
The stack argument specifies the location of the stack used by
the child process. Since the child and calling process may share
memory, it is not possible for the child process to execute in
the same stack as the calling process. The calling process must
therefore set up memory space for the child stack and pass a
pointer to this space to clone(). Stacks grow downward on all
processors that run Linux (except the HP PA processors), so stack
usually points to the topmost address of the memory space set up
for the child stack. Note that clone() does not provide a means
whereby the caller can inform the kernel of the size of the stack
area.
what about others? what do they think? @utam0k @YJDoc2 @yihuaf
I think I understand now. Even though clone() assigns a new stack to the child process, the original memory region (the old stack) may still be valid. So, those references remain effective. Do you think so? @zahash
Hey, will need some time to read all the discussion above and consider it. Probably would get back on this around the weekend....
@anti-entropy123
maybe. but how can one process directly access the memory of another process? its not allowed right?
or maybe its allowed in "parent - child" processes?
or maybe its because, the child gets an exact copy of the parent's memory when clone/fork
ref: issue #2317, PR #2369
Introducation Since nix version 0.27.1, some APIs (e.g.,
nix::pty::openpty()
,nix::sys::socket::socket()
) have started usingOwnedFd
to manage FDs. However, it is not straightforward to passOwnedFd
between forks. So in PR #2369, we still useRawFd
in the subsequent steps.This issue try to do the change from
RawFd
toOwnedFd
.