About a year ago, I started playing around with the PacketProtector (www.packetprotector.org/) load on a Linksys WRTSL54GS. I had intended it to be a replacement for my existing firewall, but it never got past acceptance testing as it crashed out after a while. After a lull and a VLAN misconfiguration that locked me out from access to the LAN ports, I decided to look into the error.
To recover the unit, I needed to see what at what point it was failing and try to regain control of it. Since PacketProtector is derived from OpenWRT, I perused their forums for answers. I purchased a USB TTL controller (http://www.amazon.com/gp/product/B006JKNWLE/ref=oh_details_o01_s00_i00), as recommended by the OpenWRT recovery article (http://wiki.openwrt.org/doc/hardware/port.serial) and the hardware specific article (http://wiki.openwrt.org/toh/linksys/wrtsl54gs?s[]=wrtsl54gs). When allowed to run for a while, this error shows up on the console, ad nauseum:
__alloc_pages: 0-order allocation failed (gfp=0x20/0)
It seemed to be indicative of either a USB OHCI error, or a memory error from Google searches. The following article had the most relevant data (https://packetprotector.org/forum/viewtopic.php?id=4362). Other Openwrt forum posts indicate that usb ohci failures are possible. However, most of the posts pointed to a memory leak that consumes everything. To test this, I set the unit to run for a while with periodic process and 'free' assessments to watch how the memory flows out. If it was leaking, I wanted to determine which service it is. The analysis script follows:
# cat /packetprotector/memanal.sh
#!/bin/sh
while [[ 1 == 1 ]]; do
echo;
echo "####################";
ps | sort | uniq;
date;
free;
sleep 60;
done
When run from the TTL console at /dev/ttyUSB0 under minicom with console capture engaged, it provided a usable log from which to determine how long expected runtime was, as well as what the process is hogging up the memory. Here was the last log before failure:
Sat Jan 1 00:54:36 UTC 2000
total used free shared buffers
Mem: 30512 28284 2228 0 316
Swap: 65524 38804 26720
Total: 96036 67088 28948
####################
1 root 136 S init
2 root SW [keventd]
3 root RWN [ksoftirqd_CPU0]
4 root DW [kswapd]
5 root SW [bdflush]
6 root SW [kupdated]
9 root SW [mtdblockd]
54 root SWN [jffs2_gcd_mtd4]
67 root 136 S logger -s -p 6 -t
69 root 152 S /bin/ash --login
78 root 124 S /sbin/syslogd -C16 -m 0
81 root 112 S /sbin/klogd
118 root 92 S /sbin/hotplug2 --override --persistent --max-children
222 root 168 S udhcpc -t 0 -i eth1 -b -p /var/run/eth1.pid -R
255 root SW [khubd]
376 root SW [usb-storage-0]
377 root SW [scsi_eh_0]
641 root 244 S crond -c /etc/crontabs
646 root 96 S /usr/sbin/dropbear -p 22
663 nobody 108 S /usr/sbin/dnsmasq --dhcp-range=lan,192.168.1.100,192.
713 root 100 S mini_httpd -C /etc/mini_httpd/mini_httpd.conf
731 root 160 S /packetprotector/usr/sbin/snort-inline -Q -D -c /pack
786 root 208 S /packetprotector/usr/sbin/openvpn /packetprotector/et
799 nobody 124 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
800 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
801 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
802 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
803 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
804 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
805 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
806 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
807 nobody 100 S /packetprotector/usr/sbin/tinyproxy -c /packetprotect
808 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
809 root 116 S /packetprotector/usr/sbin/dansguardian -c /packetprot
810 root 116 S /packetprotector/usr/sbin/dansguardian -c /packetprot
811 root 116 S /packetprotector/usr/sbin/dansguardian -c /packetprot
812 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
813 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
814 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
816 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
817 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
818 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
819 root 120 S /packetprotector/usr/sbin/dansguardian -c /packetprot
836 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
839 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
840 root 248 R top -n1
841 root 112 S grep snort
842 root 112 S grep 731
843 root 136 S /bin/sh /packetprotector/usr/sbin/snort_check
844 root 124 S /bin/sh /packetprotector/usr/sbin/snort_check
848 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
851 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
852 root 260 R top -n1
853 root 120 S grep snort
854 root 112 S grep 731
855 root 136 S /bin/sh /packetprotector/usr/sbin/snort_check
856 root 124 S /bin/sh /packetprotector/usr/sbin/snort_check
860 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
863 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
[...] x 1798 times
1852 root 196 D top -n1
1853 root 92 S grep snort
1854 root 80 S grep 731
1855 root 132 S /bin/sh /packetprotector/usr/sbin/snort_check
1856 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
1861 root 116 S /bin/sh /packetprotector/usr/sbin/snort_check
1863 root 116 S /bin/sh /packetprotector/usr/sbin/snort_check
1864 root 164 D top -n1
1865 root 104 S grep snort
1866 root 80 S grep 731
1867 root 132 S /bin/sh /packetprotector/usr/sbin/snort_check
1868 root 120 S /bin/sh /packetprotector/usr/sbin/snort_check
1870 root 104 S /bin/sh -c /packetprotector/usr/sbin/proxylogrotate.s
1873 root 116 S /bin/sh /packetprotector/usr/sbin/snort_check
1875 root 124 D /bin/sh /packetprotector/usr/sbin/proxylogrotate.sh
1877 root 104 S /bin/sh -c /packetprotector/usr/sbin/watchdog > /dev/
1878 root 112 D /bin/sh /packetprotector/usr/sbin/watchdog
1879 root /bin/sh /packetprotector/usr/sbin/watchdog
1880 root 116 D /bin/sh /packetprotector/usr/sbin/snort_check
The failure happened at 55 minutes after boot. This was interesting, as the swap was being consumed by prior runs at about one megabyte per minute. Something caused it to die before it hit the end of swap, possibly due to overallocation of processing resources. However, the memory leak could clearly be traced back to the many 'snort_check' scripts that are not dying as expected. A copy of that script follows:
$ cat /packetprotector/usr/sbin/snort_check
#!/bin/sh
#
# Changed snort_check 12/22/09 - CMH714
#
#
# Set High that snort shouldnt go over
# Default should be something very high like 140
# The memory snort eats is relative to the rules enabled.
HI_PCT="130"
echo "hi pct - " $HI_PCT
pid=`pidof snort-inline`
echo "pid - " "$pid"
if [ -n "$pid" ] ; then
SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
fi
echo "snortpct - " "$SNORTPCT"
SNORT_INLINE=`grep "snort-inline=" /etc/packetprotector.conf | cut -d "=" -f 2`
if [ -n "$SNORTPCT" ] && [ -n "$pid" ] ; then
echo "Snorts running - Checking its Memory Usage"
if [ "$SNORTPCT" -gt "$HI_PCT" ] && [ -n "$pid" ] ; then
EPOCHTIME=`date +%s`
LOGDIR="/packetprotector/logs"
mv $LOGDIR/snort-inline/alert $LOGDIR/snort-inline/alert.$EPOCHTIME && touch $LOGDIR/snort-inline/alert
# Create a log file, assumes directory exists already
echo "snort needs to restart - killing snort" > /packetprotector/logs/snort-restart.log
echo `date` >> /packetprotector/logs/snort-restart.log
# killall snort-inline 2> /dev/null
killall -1 snort-inline 2> /dev/null
# restart snort - dont need to do this as watchdog will get it next time around
# /packetprotector/etc/init.d/S11snort-inline
fi
elif ([ -n "$SNORTPCT" ] && [ -z "$pid" ]) || ([ -z "$SNORTPCT" ] && [ -n "$pid" ]) ; then
echo "Snort may be going up or down. Will check again next minute..."
elif [ -z "$SNORTPCT" ] && [ -z "$pid" ] && [ "X$SNORT_INLINE" = "X1" ] ; then
echo "restarting snort"
echo "restarting snort" >> /packetprotector/logs/snort-restart.log
echo `date` >> /packetprotector/logs/snort-restart.log
/packetprotector/etc/init.d/S11snort-inline
else
return
fi
After review, it became clear that the script wouldn't reach conclusion and close on its own. Consider the following line:
SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
The TOP command was called without the '-b' option to put it into batch mode. This kept the TOP process in memory, constantly updating and consuming memory as a running process, keeping the script and its following pipes alive until memory is exhausted.
Patching the line to add the batch flag seemed to have slowed the stacking up of the scripts, but did not eliminate the problem. The modification merely dragged the failure out for a few hours more:
# diff snort_check snort_check.mod
16c16
< SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
---
> SNORTPCT=`top -b -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
Further investigation running the command manually revealed that TOP did not exit as it should from the count mode (-n 1), holding and updating the process as it ran forward. The following memory status at failure mirrored the one before.
Sat Jan 1 01:03:37 UTC 2000
total used free shared buffers
Mem: 30512 29868 644 0 288
Swap: 65524 33356 32168
Total: 96036 63224 32812
top proc: 49
snort procs: 252
TOP not exiting caused over-consumption and, ultimately, a process contention problem. TOP was called within the /packetprotector/usr/sbin/snort_check script, which itself was executed every minute from the /packetprotector/usr/sbin/watchdog script as scheduled by CRON. To alleviate the problem immediately, the watchdog was modified to remove the snort_check script, as below.
# cat /packetprotector/usr/sbin/watchdog
#!/bin/sh
/packetprotector/usr/sbin/nas_check &
#/packetprotector/usr/sbin/snort_check &
After seven hours of running with no traffic, the memory utilization stabilized and held constant at 32MB. This indicated a need to modify the snort_check script to improve the memory monitoring process. Since TOP was not performing as documented per the options, another method was needed to get the memory information. /proc/[pid]/status has the VmRSS information that is reported to TOP, so this was used to draw the information directly for the script.
/proc/[pid]/status - look for VmRSS for the in memory value
/proc/[pid]/cmdline - look for the command line executed for a grep
The following modifications were made:
# diff snort_check snort_check.mod
10c10,12
< HI_PCT="130"
---
> # 20120805 - Changing PCT to an absolute value for a limit. Adjust as needed. Removes the need to use TOP.
> #HI_PCT="130"
> HI_PCT="20000"
16c18
< SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
---
> SNORTPCT=`cat /proc/$pid/status | grep VmRSS | awk '{print $2}'`
/packetprotector/usr/sbin/watchdog was modified to call the new script.
# cat usr/sbin/watchdog
#!/bin/sh
/packetprotector/usr/sbin/nas_check &
#/packetprotector/usr/sbin/snort_check &
/packetprotector/usr/sbin/snort_check.mod &
The new script was run under monitoring for a while to ensure the new watchdog process was working. Consumed memory was now unaffected and the script is confirmed to be working properly. The change was made permanent to my unit and has eliminated crashing for more than a week.
2 comments:
Thanks for that blog post. Great troubleshooting process. Makes me want to load PP back on my router.
Very nice deconstruction and troubleshooting. Thanks for the posting.
Post a Comment