webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

Warcio does not support replay of sites hosted on NCSA 1.5 #141

Open omgoo opened 2 years ago

omgoo commented 2 years ago

Here is an interesting one for you Ilya.

The original NCSA 1.5 web server responds with "HTTP 200 Document follows" rather than HTTP/1.0.

In recorderloader.py HTTP_TYPES is only looking for 'HTTP/1.0', 'HTTP/1.1'.

Modifying HTTP_TYPES to look for 'HTTP/1.0', 'HTTP/1.1', 'HTTP' does allow the request web page to replay. I'd add this as a PR but I doubt this is the best idea.

Here is the header from the ARC file in question:

http://www.open.gov.uk:80/ofsted/nursery/rp511200.htm 193.32.28.8 19970616061332 text/html 30594
HTTP 200 Document follows
Date: Mon, 16 Jun 1997 07:09:23 GMT
Server: NCSA/1.5.1
Last-modified: Fri, 09 May 1997 20:24:52 GMT
Content-type: text/html
Content-length: 30414

This is the url in question but you'll only see a 500 error:

https://webarchive.nationalarchives.gov.uk//ukgwa/19970616061332/http://www.open.gov.uk:80/ofsted/nursery/rp511200.htm

I'll share the ARC file with you if I can get permission to release it.

omgoo commented 2 years ago

I'm not sure on further investigation if this is an NCSA issue or an issue with the 1997 IA ARCs. I can't find a version of NCSA 1.5 to test my theory.

wsdookadr commented 2 years ago

I can't find a version of NCSA 1.5 to test my theory.

That's okay, I found a copy of the old 1995 source code of NCSA 1.5 here , and an old copy of the docs sits here. Then I wrote a patch for it. Not that much has changed:

Here is the patch:

patch1.txt ``` diff --git a/BUGS b/BUGS index c173b2d..4a43b40 100644 --- a/BUGS +++ b/BUGS @@ -12,7 +12,7 @@ Known Bugs in 1.5.1 *) Relative urls in imagemaps are broken *) doesn't kill cgi scripts if the user aborts *) content_length gets reset after scanning cgi headers, instead of before -*) can core dump on special case in getline (rfc822 line wrapping) +*) can core dump on special case in getline1 (rfc822 line wrapping) Known Bugs in 1.5.1b3 --------------------- diff --git a/CHANGES b/CHANGES index 826a6fe..f544900 100644 --- a/CHANGES +++ b/CHANGES @@ -12,7 +12,7 @@ Changes for 1.5.2a Changes for 1.5.2 ------------------ -*) Changed getline rfc822 line wrap to check for validity of the next bits +*) Changed getline1 rfc822 line wrap to check for validity of the next bits before attempting to see them *) Changed imagemap.c so relative URLs actually work *) Don't core dump on a method only request @@ -58,7 +58,7 @@ Changes for 1.5.1 *) Why do we require full URLs in Redirect? A local (root) url should work fine *) Redirect from .htaccess should work now (completely) *) Added hack to allow SSI of CGI, at a great expense of speed (CGI_SSI_HACK) -*) Made getline() code re-entrant (now has its own sock_buf struct) +*) Made getline1() code re-entrant (now has its own sock_buf struct) @@ -219,7 +219,7 @@ Fixes for 1.5 Beta 3 *) Now log start command to error_log *) Improved usage function (for -v command line) *) Made sigjmp_buf default define for JMP_BUF (missed from 1.4.2) -*) Fixed getline() +*) Fixed getline1() diff --git a/cgi-src/change-passwd.c b/cgi-src/change-passwd.c index fe1dd5a..4976046 100644 --- a/cgi-src/change-passwd.c +++ b/cgi-src/change-passwd.c @@ -151,7 +151,7 @@ main(int argc, char *argv[]) { } found = 0; - while(!(getline(line,256,f))) { + while(!(getline1(line,256,f))) { if(found || (line[0] == '#') || (!line[0])) { putline(tfp,line); continue; diff --git a/cgi-src/imagemap.c b/cgi-src/imagemap.c index 9f99c72..5c42429 100644 --- a/cgi-src/imagemap.c +++ b/cgi-src/imagemap.c @@ -41,7 +41,7 @@ ** 03/07/95: Carlos Varela, cvarela@ncsa.uiuc.edu ** ** 1.8 : Fixed bug (strcat->sprintf) when reporting error. -** Included getline() function from util.c in NCSA httpd distribution. +** Included getline1() function from util.c in NCSA httpd distribution. ** ** 11/08/95: Brandon Long, blong@ncsa.uiuc.edu ** @@ -124,7 +124,7 @@ int main(int argc, char **argv) goto openconf; } - while(!(getline(input,MAXLINE,fp))) { + while(!(getline1(input,MAXLINE,fp))) { char confname[MAXLINE]; if((input[0] == '#') || (!input[0])) continue; @@ -163,7 +163,7 @@ int main(int argc, char **argv) servererr(errstr); } - while(!(getline(input,MAXLINE,fp))) { + while(!(getline1(input,MAXLINE,fp))) { char type[MAXLINE]; char url[MAXLINE]; char num[10]; @@ -377,7 +377,7 @@ int isname(char c) return (!isspace(c)); } -int getline(char *s, int n, FILE *f) { +int getline1(char *s, int n, FILE *f) { register int i=0; while(1) { diff --git a/cgi-src/util.c b/cgi-src/util.c index c3d5d65..b01a9f6 100644 --- a/cgi-src/util.c +++ b/cgi-src/util.c @@ -95,7 +95,7 @@ int rind(char *s, char c) { return -1; } -int getline(char *s, int n, FILE *f) { +int getline1(char *s, int n, FILE *f) { register int i=0; while(1) { diff --git a/cgi-src/util.h b/cgi-src/util.h index eded336..432bd42 100644 --- a/cgi-src/util.h +++ b/cgi-src/util.h @@ -6,7 +6,7 @@ char x2c(char *what); void unescape_url(char *url); void plustospace(char *str); int rind(char *s, char c); -int getline(char *s, int n, FILE *f); +int getline1(char *s, int n, FILE *f); void send_fd(FILE *f, FILE *fd); int ind(char *s, char c); void escape_shell_cmd(char *cmd); diff --git a/conf/httpd.conf-dist b/conf/httpd.conf-dist index dc96e72..f0d3b8f 100644 --- a/conf/httpd.conf-dist +++ b/conf/httpd.conf-dist @@ -27,7 +27,7 @@ ServerType standalone # need HTTPd to be run as root initially. # Default: 80 (or DEFAULT_PORT) -Port 80 +Port 8412 # StartServers: The number of servers to launch at startup. Must be # compiled without the NO_PASS compile option @@ -66,8 +66,8 @@ TimeOut 1200 # User/Group: The name (or #number) of the user/group to run HTTPd as. # Default: #-1 (or DEFAULT_USER / DEFAULT_GROUP) -User nobody -Group #-1 +User ncsa +Group ncsa # IdentityCheck: Enables or disables RFC931 compliant logging of the # remote user name for sites which run identd or something similar. @@ -97,7 +97,7 @@ Group #-1 # Default: If you do not specify a ServerName, HTTPd attempts to retrieve # it through system calls. -#ServerName new.host.name +ServerName localhost # ServerAdmin: Your address, where problems with the server should be # e-mailed. @@ -262,8 +262,8 @@ DNSMode Standard # VirtualHost as Optional or Required. -DocumentRoot /local -ServerName localhost.ncsa.uiuc.edu +DocumentRoot /usr/local/etc/httpd/htdocs +ServerName localhost ResourceConfig conf/localhost_srm.conf diff --git a/conf/srm.conf-dist b/conf/srm.conf-dist index 1d713fb..9cec3d2 100644 --- a/conf/srm.conf-dist +++ b/conf/srm.conf-dist @@ -49,12 +49,12 @@ ScriptAlias /cgi-bin/ /usr/local/etc/httpd/cgi-bin/ # FCGIScritpAlias: Same as ScriptAlias, except for FCGI scripts # Format: FCGIScriptAlias fakename realname -FCGIScriptAlias /fcgi-bin/ /usr/local/etc/httpd/fcgi-devel-kit/examples/ +# FCGIScriptAlias /fcgi-bin/ /usr/local/etc/httpd/fcgi-devel-kit/examples/ # Define the AppClasses. These get hit when requests come in for # /fcgi-bin/tiny-fcgi.fcgi or /fcgi-bin/tiny-fcgi2.fcgi -AppClass /usr/local/etc/httpd/fcgi-devel-kit/examples/tiny-fcgi.fcgi -listen-queue-depth 10 -processes 2 -AppClass /usr/local/etc/httpd/fcgi-devel-kit/examples/tiny-fcgi2.fcgi -listen-queue-depth 10 -processes 2 +#AppClass /usr/local/etc/httpd/fcgi-devel-kit/examples/tiny-fcgi.fcgi -listen-queue-depth 10 -processes 2 +#AppClass /usr/local/etc/httpd/fcgi-devel-kit/examples/tiny-fcgi2.fcgi -listen-queue-depth 10 -processes 2 #=========================================================================== # Directory Indexing @@ -151,6 +151,7 @@ DefaultType text/plain #AddType text/x-imagemap .map #AddType application/x-httpd-cgi .cgi #AddType application/x-httpd-fcgi .fcgi +#AddType application/x-httpd-cgi .cgi #=========================================================================== # Misc Server Resources diff --git a/src/CHANGES b/src/CHANGES index 3432366..ccd3c35 100644 --- a/src/CHANGES +++ b/src/CHANGES @@ -13,7 +13,7 @@ Fixes for 1.5.2 ------------------ -*) Changed getline rfc822 line wrap to check for validity of the next bits +*) Changed getline1 rfc822 line wrap to check for validity of the next bits before attempting to see them *) Changed imagemap.c so relative URLs actually work *) Don't core dump on a method only request diff --git a/src/HTTPd_REQ_PATH b/src/HTTPd_REQ_PATH index d081916..b494d34 100644 --- a/src/HTTPd_REQ_PATH +++ b/src/HTTPd_REQ_PATH @@ -13,13 +13,13 @@ child_main httpd.c free RequestMain http_request.c signal - getline + getline1 setproctitle decode_request http_request.c strtok MapMethod http_request.c get_mime_headers - getline + getline1 strchr isspace strcasecmp @@ -73,7 +73,7 @@ child_main httpd.c stat FOpen fdwrap.c parse_access_dir http_config.c - cfg_getline util.c + cfg_getline1 util.c access_syntax_error http_config.c cfg_getword util.c add_type http_mime.c @@ -151,11 +151,11 @@ child_main httpd.c add_cgi_vars error_log2stderr execle/execve - getline util.c + getline1 util.c write read scan_script_header cgi.c - getline util.c + getline1 util.c strdup realloc waitpid @@ -167,7 +167,7 @@ child_main httpd.c dump_default_header http_mime.c send_script cgi.c alarm - getline + getline1 write read kill_children cgi.c diff --git a/src/Makefile b/src/Makefile index 381b9b9..f2aac41 100644 --- a/src/Makefile +++ b/src/Makefile @@ -60,7 +60,7 @@ KRB5_CFLAGS = -DKRB5 -I$(KRB5_DIR)/include -I$(KRB5_DIR)/include/krb5 # # To enable DBM password/groupfile support, define the DBM_SUPPORT flag -DBM_CFLAGS = -DDBM_SUPPORT +DBM_CFLAGS = "-lgdbm -lgdbm_compat -lcrypt" #DBM_LIBS = -lndbm #DBM_LIBS = -ldbm #DBM_LIBS = -lgdbm @@ -187,11 +187,11 @@ hp-cc: make tar AUX_CFLAGS=-DHPUX CC=cc CFLAGS="-O -Aa" DBM_LIBS=-lndbm linux: - make tar AUX_CFLAGS=-DLINUX CC=gcc CFLAGS=-O2 DBM_LIBS=-lgdbm + make tar AUX_CFLAGS=-DLINUX CC=gcc CFLAGS=-O2 DBM_LIBS="-lgdbm -lcrypt" linux2: linux linux1: - make tar AUX_CFLAGS="-DLINUX -DFD_LINUX" CC=gcc CFLAGS=-O2 DBM_LIBS=-lgdbm + make tar AUX_CFLAGS="-DLINUX -DFD_LINUX" CC=gcc CFLAGS=-O2 DBM_LIBS="-lgdbm -lcrypt" netbsd: make tar AUX_CFLAGS=-DNETBSD EXTRA_LIBS=-lcrypt CC=cc CFLAGS=-O2 diff --git a/src/cgi.c b/src/cgi.c index 4a13a13..369504d 100644 --- a/src/cgi.c +++ b/src/cgi.c @@ -267,7 +267,7 @@ int scan_cgi_header(per_request *reqInfo, int pd) /* ADC put in the G_SINGLE_CHAR option, so that CGI SSI's would work. * it was: - * if((ret = getline(reqInfo->cgi_buf,str,HUGE_STRING_LEN-1,0,timeout)) <= 0) + * if((ret = getline1(reqInfo->cgi_buf,str,HUGE_STRING_LEN-1,0,timeout)) <= 0) * * This should be cleaned up perhaps so that it only does this if SSI's are * allowed for this script directory. ZZZZ @@ -278,7 +278,7 @@ int scan_cgi_header(per_request *reqInfo, int pd) #endif /* CGI_SSI_HACK */ while(1) { - if((ret = getline(reqInfo->cgi_buf,str,HUGE_STRING_LEN-1,options,timeout)) <= 0) + if((ret = getline1(reqInfo->cgi_buf,str,HUGE_STRING_LEN-1,options,timeout)) <= 0) { char error_msg[MAX_STRING_LEN]; Close(pd); @@ -508,7 +508,7 @@ int cgi_stub(per_request *reqInfo, struct stat *finfo, int allow_options) int nDone = 0; signal(SIGPIPE,SIG_IGN); - nBytes=getline(reqInfo->sb, szBuf,HUGE_STRING_LEN,G_FLUSH, timeout); + nBytes=getline1(reqInfo->sb, szBuf,HUGE_STRING_LEN,G_FLUSH, timeout); nTotalBytes = nBytes; if (nBytes >= 0) { if (nBytes > 0) write(p2[1], szBuf, nBytes); @@ -538,10 +538,10 @@ int cgi_stub(per_request *reqInfo, struct stat *finfo, int allow_options) } /* Previously, this was broken because we read the results of the CGI using - * getline, but the SSI parser used buffered stdio. + * getline1, but the SSI parser used buffered stdio. * * ADC changed scan_cgi_header so that it uses G_SINGLE_CHAR when it - * calls getline. Yes, this means pitiful performance for CGI scripts. + * calls getline1. Yes, this means pitiful performance for CGI scripts. */ /* Fine, parse the output of CGI scripts. Talk about useless * overhead. . . @@ -620,7 +620,7 @@ long send_fd(per_request *reqInfo, int pd, void (*onexit)(void)) alarm(timeout); if (reqInfo->cgi_buf != NULL) - n=getline(reqInfo->cgi_buf, buf,IOBUFSIZE,G_FLUSH,timeout); + n=getline1(reqInfo->cgi_buf, buf,IOBUFSIZE,G_FLUSH,timeout); else n = 0; while (1) { diff --git a/src/digest.c b/src/digest.c index ba7b0e9..76678cd 100644 --- a/src/digest.c +++ b/src/digest.c @@ -63,7 +63,7 @@ int get_digest(per_request *reqInfo, char *user, char *realm, char *digest, reqInfo->auth_digestfile); die(reqInfo,SC_SERVER_ERROR,errstr); } - while(!(cfg_getline(l,MAX_STRING_LEN,f))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,f))) { if((l[0] == '#') || (!l[0])) continue; getword(w,l,':'); getword(r,l,':'); diff --git a/src/fcgi.c b/src/fcgi.c index be836ba..40ae76b 100644 --- a/src/fcgi.c +++ b/src/fcgi.c @@ -2310,7 +2310,7 @@ static int FastCgiDoWork(WS_Request *reqPtr, FastCgiInfo *infoPtr) if (nFirst) { char szBuf[IOBUFSIZE]; - nBytes=getline(reqPtr->sb, szBuf,IOBUFSIZE,G_FLUSH,0); + nBytes=getline1(reqPtr->sb, szBuf,IOBUFSIZE,G_FLUSH,0); BufferAddData(infoPtr->reqInbufPtr, szBuf, nBytes); if (nBytes > 0) { BufferAddData(infoPtr->reqInbufPtr, szBuf, nBytes); diff --git a/src/fdwrap.c b/src/fdwrap.c index dd33a81..fe2e68e 100644 --- a/src/fdwrap.c +++ b/src/fdwrap.c @@ -20,8 +20,8 @@ * */ -#include "config.h" #include "portability.h" +#include "config.h" #include #ifndef NO_STDLIB_H diff --git a/src/http_access.c b/src/http_access.c index f7a4827..5e2668e 100644 --- a/src/http_access.c +++ b/src/http_access.c @@ -180,11 +180,8 @@ int find_host_deny(per_request *reqInfo, int x) return FA_ALLOW; } -/* match_referer() - * currently matches restriction with sent for only as long as restricted - */ -int match_referer(char *restrict, char *sent) { - return !(strcmp_match(sent,restrict)); +int match_referer(char *restrict_, char *sent) { + return !(strcmp_match(sent,restrict_)); } /* find_referer_allow() diff --git a/src/http_auth.c b/src/http_auth.c index 5139dd5..3f90656 100644 --- a/src/http_auth.c +++ b/src/http_auth.c @@ -140,7 +140,7 @@ int get_pw(per_request *reqInfo, char *user, char *pw, security_data* sec) if (reqInfo->auth_pwfile_type == AUTHFILETYPE_STANDARD) { /* From Conrad Damon (damon@netserver.standford.edu), - Don't start cfg_getline loop if auth_pwfile is a directory. */ + Don't start cfg_getline1 loop if auth_pwfile is a directory. */ if ((stat (reqInfo->auth_pwfile, &finfo) == -1) || (!S_ISREG(finfo.st_mode))) { @@ -152,7 +152,7 @@ int get_pw(per_request *reqInfo, char *user, char *pw, security_data* sec) sprintf(errstr,"Could not open user file %s",reqInfo->auth_pwfile); die(reqInfo,SC_SERVER_ERROR,errstr); } - while(!(cfg_getline(l,MAX_STRING_LEN,f))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,f))) { if((l[0] == '#') || (!l[0])) continue; getword(w,l,':'); diff --git a/src/http_config.c b/src/http_config.c index 54aee66..b21751a 100644 --- a/src/http_config.c +++ b/src/http_config.c @@ -186,7 +186,7 @@ void process_server_config(per_host *host, FILE *cfg, FILE *errors, if (!virtual) n=0; /* Parse server config file. Remind me to learn yacc. */ - while(!(cfg_getline(l,MAX_STRING_LEN,cfg))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,cfg))) { ++n; if((l[0] != '#') && (l[0] != '\0')) { cfg_getword(w,l); @@ -541,7 +541,7 @@ void process_resource_config(per_host *host, FILE *open, FILE *errors, else return; } } else cfg = open; - while(!(cfg_getline(l,MAX_STRING_LEN,cfg))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,cfg))) { ++n; if((l[0] != '#') && (l[0] != '\0')) { cfg_getword(w,l); @@ -862,7 +862,7 @@ int parse_access_dir(per_request *reqInfo, FILE *f, int line, char or, sec[x].on_deny[i] = NULL; } - while(!(cfg_getline(l,MAX_STRING_LEN,f))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,f))) { ++n; if((l[0] == '#') || (!l[0])) continue; cfg_getword(w,l); @@ -1198,7 +1198,7 @@ int parse_access_dir(per_request *reqInfo, FILE *f, int line, char or, else if(!strcasecmp(w,"DELETE")) m[M_DELETE]=1; } while(1) { - if(cfg_getline(l,MAX_STRING_LEN,f)) + if(cfg_getline1(l,MAX_STRING_LEN,f)) access_syntax_error(reqInfo,n,"Limit missing /Limit",f,file); n++; if((l[0] == '#') || (!l[0])) continue; @@ -1393,7 +1393,7 @@ void process_access_config(FILE *errors) perror("fopen"); exit(1); } - while(!(cfg_getline(l,MAX_STRING_LEN,f))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,f))) { ++n; if((l[0] == '#') || (!l[0])) continue; cfg_getword(w,l); diff --git a/src/http_mime.c b/src/http_mime.c index 0d89048..f1f0151 100644 --- a/src/http_mime.c +++ b/src/http_mime.c @@ -146,7 +146,7 @@ void init_mime(void) forced_types = NULL; encoding_types = NULL; - while(!(cfg_getline(l,MAX_STRING_LEN,f))) { + while(!(cfg_getline1(l,MAX_STRING_LEN,f))) { if(l[0] == '#') continue; cfg_getword(w,l); if(!(ct = (char *)malloc(sizeof(char) * (strlen(w) + 1)))) diff --git a/src/http_request.c b/src/http_request.c index 57e6808..9973839 100644 --- a/src/http_request.c +++ b/src/http_request.c @@ -484,7 +484,7 @@ void get_http_headers(per_request *reqInfo) char *field_val; int options = 0; - while(getline(reqInfo->sb,field_type,HUGE_STRING_LEN-1,options, + while(getline1(reqInfo->sb,field_type,HUGE_STRING_LEN-1,options, timeout) != -1) { if(!field_type[0]) @@ -612,7 +612,7 @@ void RequestMain(per_request *reqInfo) sockbuf_count++; } - if (getline(reqInfo->sb, as_requested, HUGE_STRING_LEN, + if (getline1(reqInfo->sb, as_requested, HUGE_STRING_LEN, options, timeout) == -1) return; diff --git a/src/portability.h b/src/portability.h index 7f4fc9f..62a9fdb 100644 --- a/src/portability.h +++ b/src/portability.h @@ -20,6 +20,7 @@ #ifndef _PORTABILITY_H_ #define _PORTABILITY_H_ + /* Define one of these according to your system. */ #if defined(SUNOS4) #define BSD @@ -30,6 +31,7 @@ char *crypt(char *pw, char *salt); #define DIR_FILENO(p) ((p)->dd_fd) + #elif defined(SOLARIS2) #undef BSD #define NO_KILLPG @@ -210,7 +212,7 @@ typedef int mode_t; #endif /* Needed for newer versions of libc (5.2.x) to use FD_LINUX hack */ #define DIRENT_ILLEGAL_ACCESS -#define DIR_FILENO(p) ((p)->dd_fd) +#define DIR_FILENO(p) (dirfd(p)) #define CMSG_DATA(cmptr) ((cmptr)->cmsg_data) #define NEED_SYS_UN_H #undef BSD diff --git a/src/rfc822.c b/src/rfc822.c index ad13ad0..02309a8 100644 --- a/src/rfc822.c +++ b/src/rfc822.c @@ -3,8 +3,8 @@ 30-Aug-94 ekr */ -/*A wrapper around getline to do rfc822 line unfolding*/ -int ht_rfc822_getline(char *s,int n,int f,unsigned int timeout) +/*A wrapper around getline1 to do rfc822 line unfolding*/ +int ht_rfc822_getline1(char *s,int n,int f,unsigned int timeout) { static char pb=0; int len; @@ -22,7 +22,7 @@ int ht_rfc822_getline(char *s,int n,int f,unsigned int timeout) return(0); } - while(!getline(s,n,f,timeout)){ + while(!getline1(s,n,f,timeout)){ len=strlen(s); s+=len; n-=len; diff --git a/src/util.c b/src/util.c index 5e81c52..750970f 100644 --- a/src/util.c +++ b/src/util.c @@ -545,7 +545,7 @@ void http2cgi(char* h, char *w) { w++; } -void getline_timed_out(int sig) +void getline1_timed_out(int sig) { char errstr[MAX_STRING_LEN]; @@ -582,7 +582,7 @@ sock_buf *new_sock_buf(per_request *reqInfo, int sd) * This routine is currently not thread safe. * This routine may be thread safe. (blong 3/13/96) */ -int getline(sock_buf *sb, char *s, int n, int options, unsigned int timeout) +int getline1(sock_buf *sb, char *s, int n, int options, unsigned int timeout) { char *endp = s + n - 1; int have_alarmed = 0; @@ -614,7 +614,7 @@ int getline(sock_buf *sb, char *s, int n, int options, unsigned int timeout) do { if (sb->buf_posn == sb->buf_good) { have_alarmed = 1; - signal(SIGALRM,getline_timed_out); + signal(SIGALRM,getline1_timed_out); alarm(timeout); ret=read(sb->sd, sb->buffer, size); @@ -738,7 +738,7 @@ int eat_ws (FILE* fp) return ch; } -int cfg_getline (char* s, int n, FILE* fp) +int cfg_getline1 (char* s, int n, FILE* fp) { int len = 0, ch; diff --git a/src/util.h b/src/util.h index 41f78e1..3c1079d 100644 --- a/src/util.h +++ b/src/util.h @@ -24,7 +24,7 @@ #include #include -/* getline options */ +/* getline1 options */ #define G_RESET_BUF 1 #define G_FLUSH 2 #define G_SINGLE_CHAR 4 @@ -49,10 +49,10 @@ void getparents(char *name); void no2slash(char *name); uid_t uname2id(char *name); gid_t gname2id(char *name); -int getline(sock_buf *sb, char *s, int n, int options, unsigned int timeout); +int getline1(sock_buf *sb, char *s, int n, int options, unsigned int timeout); sock_buf *new_sock_buf(per_request *reqInfo, int sd); int eat_ws (FILE* fp); -int cfg_getline(char *s, int n, FILE *f); +int cfg_getline1(char *s, int n, FILE *f); void getword(char *word, char *line, char stop); void splitURL(char *line, char *url, char *args); void cfg_getword(char *word, char *line); diff --git a/start.sh b/start.sh new file mode 100755 index 0000000..1ebbefa --- /dev/null +++ b/start.sh @@ -0,0 +1,5 @@ +#!/bin/bash +./httpd +while true; do + sleep 1; +done diff --git a/support/Makefile b/support/Makefile index 26c65eb..afeef87 100644 --- a/support/Makefile +++ b/support/Makefile @@ -49,7 +49,7 @@ hp-gcc: make all CC=gcc CFLAGS="-DHPUX" EXTRA_LIBS=-lndbm linux: - make all CC=gcc CFLAGS="-DLINUX" EXTRA_LIBS=-lgdbm + make all CC=gcc CFLAGS="-DLINUX" EXTRA_LIBS="-lcrypt -lgdbm -lgdbm_compat" netbsd: make all CC=cc CFLAGS="-DNETBSD" EXTRA_LIBS=-lcrypt diff --git a/support/dbmdigest.c b/support/dbmdigest.c index 75f22db..bd3473c 100644 --- a/support/dbmdigest.c +++ b/support/dbmdigest.c @@ -42,7 +42,7 @@ void getword(char *word, char *line, char stop) { while(line[y++] = line[x++]); } -int getline(char *s, int n, FILE *f) { +int getline1(char *s, int n, FILE *f) { register int i=0; while(1) { @@ -166,7 +166,7 @@ main(int argc, char *argv[]) { strcpy(user,argv[2]); found = 0; - while(!(getline(line,MAX_STRING_LEN,f))) { + while(!(getline1(line,MAX_STRING_LEN,f))) { if(found || (line[0] == '#') || (!line[0])) { putline(tfp,line); continue; diff --git a/support/htpasswd.c b/support/htpasswd.c index fb3415a..cedf37d 100644 --- a/support/htpasswd.c +++ b/support/htpasswd.c @@ -45,7 +45,7 @@ void getword(char *word, char *line, char stop) { while(line[y++] = line[x++]); } -int getline(char *s, int n, FILE *f) { +int getline1(char *s, int n, FILE *f) { register int i=0; while(1) { @@ -163,7 +163,7 @@ main(int argc, char *argv[]) { strcpy(user,argv[2]); found = 0; - while(!(getline(line,MAX_STRING_LEN,f))) { + while(!(getline1(line,MAX_STRING_LEN,f))) { if(found || (line[0] == '#') || (!line[0])) { putline(tfp,line); continue; diff --git a/support/webgrab.c b/support/webgrab.c index b254c49..53cc9fa 100644 --- a/support/webgrab.c +++ b/support/webgrab.c @@ -24,6 +24,7 @@ #include #include +#include #define VERSION "1.3" ```
Dockerfile ``` FROM debian:11 RUN apt-get -y update && apt-get -y install gcc make libgdbm-dev libgdbm-compat-dev procps curl ADD ncsa-httpd /ncsa-httpd RUN cd ncsa-httpd && make clean linux RUN mkdir -p /usr/local/etc/httpd/htdocs RUN mkdir -p /usr/local/etc/httpd/logs RUN mkdir -p /usr/local/etc/httpd/conf ADD ncsa-httpd/conf/httpd.conf-dist /usr/local/etc/httpd/conf/httpd.conf ADD ncsa-httpd/conf/access.conf-dist /usr/local/etc/httpd/conf/access.conf ADD ncsa-httpd/conf/localhost_srm.conf-dist /usr/local/etc/httpd/conf/localhost_srm.conf ADD ncsa-httpd/conf/mime.types /usr/local/etc/httpd/conf/mime.types ADD ncsa-httpd/conf/srm.conf-dist /usr/local/etc/httpd/conf/srm.conf RUN useradd -ms /bin/bash ncsa WORKDIR /ncsa-httpd CMD ["./start.sh"] ```
Makefile ``` build: docker build -t ncsa-1.5 . start: docker run -p 8412:8412 --rm --name "oldncsa" ncsa-1.5 stop: docker stop oldncsa shell: docker exec -ti oldncsa bash ```

Now if I do something like this:

user@garage3:~/ncsa$ make start 
docker run -p 8412:8412 --rm --name "oldncsa" ncsa-1.5
NCSA HTTPd NCSA/1.5.2a
Licensed material.  Portions of this work are
Copyright (C) 1995-1996 Board of Trustees of the University of Illinois
Copyright (C) 1995-1996 The Apache Group
Copyright (C) 1989-1993 RSA Data Security, Inc.
Copyright (C) 1993-1994 Carnegie Mellon University
Copyright (C) 1991      Bell Communications Research, Inc. (Bellcore)
Copyright (C) 1994      Spyglass, Inc.

And it's ready to serve requests. I'm attaching a zip here with the source already patched and all aforementioned files included. In order to build the docker image you'll have to run make build.

ncsa.zip

omgoo commented 1 year ago

I've added a PR that fixes the issue in replaying webarchives that were created from servers running NCSA 1.5.1. I'm not convinced this is the best solution but it does fox our issue and allow the archive content to replay: https://github.com/webrecorder/warcio/pull/153