Saturday, August 25, 2012

Fixing PacketProtector v3.8 WRT54GLS Memory Access Error

About a year ago, I started playing around with the PacketProtector (www.packetprotector.org/) load on a Linksys WRTSL54GS. I had intended it to be a replacement for my existing firewall, but it never got past acceptance testing as it crashed out after a while. After a lull and a VLAN misconfiguration that locked me out from access to the LAN ports, I decided to look into the error.

To recover the unit, I needed to see what at what point it was failing and try to regain control of it. Since PacketProtector is derived from OpenWRT, I perused their forums for answers. I purchased a USB TTL controller (http://www.amazon.com/gp/product/B006JKNWLE/ref=oh_details_o01_s00_i00), as recommended by the OpenWRT recovery article (http://wiki.openwrt.org/doc/hardware/port.serial) and the hardware specific article (http://wiki.openwrt.org/toh/linksys/wrtsl54gs?s[]=wrtsl54gs).  When allowed to run for a while, this error shows up on the console, ad nauseum:

__alloc_pages: 0-order allocation failed (gfp=0x20/0)

It seemed to be indicative of either a USB OHCI error, or a memory error from Google searches. The following article had the most relevant data (https://packetprotector.org/forum/viewtopic.php?id=4362). Other Openwrt forum posts indicate that usb ohci failures are possible. However, most of the posts pointed to a memory leak that consumes everything. To test this, I set the unit to run for a while with periodic process and 'free' assessments to watch how the memory flows out. If it was leaking, I wanted to determine which service it is. The analysis script follows:

  # cat /packetprotector/memanal.sh
  #!/bin/sh                                                                      
  while [[ 1 == 1 ]]; do                                                         
          echo;                                                                  
          echo "####################";                                           
          ps | sort | uniq;                                                      
          date;                                                                  
          free;                                                                  
          sleep 60;                                                              
  done       

When run from the TTL console at /dev/ttyUSB0 under minicom with console capture engaged, it provided a usable log from which to determine how long expected runtime was, as well as what the process is hogging up the memory. Here was the last log before failure:

  Sat Jan  1 00:54:36 UTC 2000
                total         used         free       shared      buffers
    Mem:        30512        28284         2228            0          316
   Swap:        65524        38804        26720
  Total:        96036        67088        28948
 
  ####################
      1 root        136 S   init      
      2 root            SW  [keventd]
      3 root            RWN [ksoftirqd_CPU0]
      4 root            DW  [kswapd]
      5 root            SW  [bdflush]
      6 root            SW  [kupdated]
      9 root            SW  [mtdblockd]
     54 root            SWN [jffs2_gcd_mtd4]
     67 root        136 S   logger -s -p 6 -t 
     69 root        152 S   /bin/ash --login
     78 root        124 S   /sbin/syslogd -C16 -m 0
     81 root        112 S   /sbin/klogd
    118 root         92 S   /sbin/hotplug2 --override --persistent --max-children
    222 root        168 S   udhcpc -t 0 -i eth1 -b -p /var/run/eth1.pid -R
    255 root            SW  [khubd]
    376 root            SW  [usb-storage-0]
    377 root            SW  [scsi_eh_0]
    641 root        244 S   crond -c /etc/crontabs
    646 root         96 S   /usr/sbin/dropbear -p 22
    663 nobody      108 S   /usr/sbin/dnsmasq --dhcp-range=lan,192.168.1.100,192.
    713 root        100 S   mini_httpd -C /etc/mini_httpd/mini_httpd.conf
    731 root        160 S   /packetprotector/usr/sbin/snort-inline -Q -D -c /pack
    786 root        208 S   /packetprotector/usr/sbin/openvpn /packetprotector/et
    799 nobody      124 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    800 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    801 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    802 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    803 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    804 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    805 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    806 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    807 nobody      100 S   /packetprotector/usr/sbin/tinyproxy -c /packetprotect
    808 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    809 root        116 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    810 root        116 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    811 root        116 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    812 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    813 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    814 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    816 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    817 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    818 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    819 root        120 S   /packetprotector/usr/sbin/dansguardian -c /packetprot
    836 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
    839 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
    840 root        248 R   top -n1
    841 root        112 S   grep snort
    842 root        112 S   grep 731
    843 root        136 S   /bin/sh /packetprotector/usr/sbin/snort_check
    844 root        124 S   /bin/sh /packetprotector/usr/sbin/snort_check
    848 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
    851 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
    852 root        260 R   top -n1
    853 root        120 S   grep snort
    854 root        112 S   grep 731
    855 root        136 S   /bin/sh /packetprotector/usr/sbin/snort_check
    856 root        124 S   /bin/sh /packetprotector/usr/sbin/snort_check
    860 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
    863 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
 
    [...] x 1798 times
    
   1852 root        196 D   top -n1
   1853 root         92 S   grep snort
   1854 root         80 S   grep 731
   1855 root        132 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1856 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1861 root        116 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1863 root        116 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1864 root        164 D   top -n1
   1865 root        104 S   grep snort
   1866 root         80 S   grep 731
   1867 root        132 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1868 root        120 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1870 root        104 S   /bin/sh -c /packetprotector/usr/sbin/proxylogrotate.s
   1873 root        116 S   /bin/sh /packetprotector/usr/sbin/snort_check
   1875 root        124 D   /bin/sh /packetprotector/usr/sbin/proxylogrotate.sh
   1877 root        104 S   /bin/sh -c /packetprotector/usr/sbin/watchdog > /dev/
   1878 root        112 D   /bin/sh /packetprotector/usr/sbin/watchdog
   1879 root             /bin/sh /packetprotector/usr/sbin/watchdog
   1880 root        116 D   /bin/sh /packetprotector/usr/sbin/snort_check

The failure happened at 55 minutes after boot. This was interesting, as the swap was being consumed by prior runs at about one megabyte per minute. Something caused it to die before it hit the end of swap, possibly due to overallocation of processing resources. However, the memory leak could clearly be traced back to the many 'snort_check' scripts that are not dying as expected. A copy of that script follows:

  $ cat /packetprotector/usr/sbin/snort_check
  #!/bin/sh
  #
  # Changed snort_check 12/22/09 - CMH714
  #
  #
 
  # Set High that snort shouldnt go over
  # Default should be something very high like 140
  # The memory snort eats is relative to the rules enabled.
  HI_PCT="130"
  echo "hi pct - " $HI_PCT
 
  pid=`pidof snort-inline`
  echo "pid - " "$pid"
  if [ -n "$pid" ] ; then
      SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
  fi
  echo "snortpct - " "$SNORTPCT"
 
  SNORT_INLINE=`grep "snort-inline=" /etc/packetprotector.conf | cut -d "=" -f 2`
 
  if [ -n "$SNORTPCT" ] && [ -n "$pid" ] ; then
      echo "Snorts running - Checking its Memory Usage"
      if [ "$SNORTPCT" -gt "$HI_PCT" ] && [ -n "$pid" ] ; then
          EPOCHTIME=`date +%s`
          LOGDIR="/packetprotector/logs"
          mv $LOGDIR/snort-inline/alert $LOGDIR/snort-inline/alert.$EPOCHTIME && touch $LOGDIR/snort-inline/alert
          # Create a log file, assumes directory exists already
          echo "snort needs to restart - killing snort" > /packetprotector/logs/snort-restart.log
          echo `date` >> /packetprotector/logs/snort-restart.log
  #        killall snort-inline 2> /dev/null
          killall -1 snort-inline 2> /dev/null
  #        restart snort - dont need to do this as watchdog will get it next time around
  #        /packetprotector/etc/init.d/S11snort-inline
      fi
  elif ([ -n "$SNORTPCT" ] && [ -z "$pid" ]) || ([ -z "$SNORTPCT" ] && [ -n "$pid" ]) ; then
      echo "Snort may be going up or down. Will check again next minute..."
  elif [ -z "$SNORTPCT" ] && [ -z "$pid" ] && [ "X$SNORT_INLINE" = "X1" ] ; then
      echo "restarting snort"
      echo "restarting snort" >> /packetprotector/logs/snort-restart.log
      echo `date` >> /packetprotector/logs/snort-restart.log
      /packetprotector/etc/init.d/S11snort-inline
  else
      return
  fi

After review, it became clear that the script wouldn't reach conclusion and close on its own. Consider the following line:

  SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`

The TOP command was called without the '-b' option to put it into batch mode. This kept the TOP process in memory, constantly updating and consuming memory as a running process, keeping the script and its following pipes alive until memory is exhausted.

Patching the line to add the batch flag seemed to have slowed the stacking up of the scripts, but did not eliminate the problem. The modification merely dragged the failure out for a few hours more:

  # diff snort_check snort_check.mod
  16c16
  <     SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
  ---
  >     SNORTPCT=`top -b -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`

Further investigation running the command manually revealed that TOP did not exit as it should from the count mode (-n 1), holding and updating the process as it ran forward. The following memory status at failure mirrored the one before.

  Sat Jan  1 01:03:37 UTC 2000                                                   
              total         used         free       shared      buffers        
   Mem:        30512        29868          644            0          288        
   Swap:        65524        33356        32168                                  
  Total:        96036        63224        32812                                  
  top proc: 49                                                                   
  snort procs: 252 

TOP not exiting caused over-consumption and, ultimately, a process contention problem. TOP was called within the /packetprotector/usr/sbin/snort_check script, which itself was executed every minute from the /packetprotector/usr/sbin/watchdog script as scheduled by CRON. To alleviate the problem immediately, the watchdog was modified to remove the snort_check script, as below.

  # cat /packetprotector/usr/sbin/watchdog
  #!/bin/sh                                                                      
 
  /packetprotector/usr/sbin/nas_check &                                          
  #/packetprotector/usr/sbin/snort_check &    

After seven hours of running with no traffic, the memory utilization stabilized and held constant at 32MB. This indicated a need to modify the snort_check script to improve the memory monitoring process. Since TOP was not performing as documented per the options, another method was needed to get the memory information. /proc/[pid]/status has the VmRSS information that is reported to TOP, so this was used to draw the information directly for the script.

  /proc/[pid]/status - look for VmRSS for the in memory value
  /proc/[pid]/cmdline - look for the command line executed for a grep

The following modifications were made:

  # diff snort_check snort_check.mod
  10c10,12
  < HI_PCT="130"
  ---
  > # 20120805 - Changing PCT to an absolute value for a limit. Adjust as needed. Removes the need to use TOP.
  > #HI_PCT="130"
  > HI_PCT="20000"
  16c18
  <     SNORTPCT=`top -n1 | grep snort | grep $pid | cut -d% -f1 | awk '{print $NF}'`
  ---
  >     SNORTPCT=`cat /proc/$pid/status | grep VmRSS | awk '{print $2}'`
 
/packetprotector/usr/sbin/watchdog was modified to call the new script.

  # cat usr/sbin/watchdog       
  #!/bin/sh                                                                      
                                                                               
  /packetprotector/usr/sbin/nas_check &                                          
  #/packetprotector/usr/sbin/snort_check &                                       
  /packetprotector/usr/sbin/snort_check.mod & 

The new script was run under monitoring for a while to ensure the new watchdog process was working. Consumed memory was now unaffected and the script is confirmed to be working properly. The change was made permanent to my unit and has eliminated crashing for more than a week.

2 comments:

Unknown said...

Thanks for that blog post. Great troubleshooting process. Makes me want to load PP back on my router.

Unknown said...

Very nice deconstruction and troubleshooting. Thanks for the posting.