Thursday 13 July 2017

Script to remove bad characters from a set of files

The need for this script was prompted by a series of files being uploaded to Sharepoint which had special characters within their filenames such as an astrix or tilde.

Although there are many ways to achieve this I chose for a simplistic approach using cp and sed.

We can use the sed substitute function to replace any bad characters - we have the following directory we wish to 'cleanse':

ls /tmp/test

drwxrwxr-x.  2 limited limited  120 Jul 13 13:45 .
drwxrwxrwt. 40 root    root    1280 Jul 13 13:43 ..
-rw-rw-r--.  1 limited limited    0 Jul 13 13:45 'fran^k.txt'
-rw-rw-r--.  1 limited limited    0 Jul 13 13:44 @note.txt
-rw-rw-r--.  1 limited limited    0 Jul 13 13:44 'rubbi'\''sh.txt'
-rw-rw-r--.  1 limited limited    0 Jul 13 13:43 'test`.txt'

We can run a quick test to see what the results would look like just piping the result out to stdout:

#!/bin/bash
cd /tmp/test
FileList=*
for file in $FileList; 
    do (echo $file | sed s/[\'\`^@]/_/g ); 
done;

Note: The 'g' option instructs sed to substitute all matches on each line.

Or an even better approach (adapted from here):

#!/bin/bash
cd /tmp/test
FileList=*
for file in $FileList; 
    do (echo $file | sed s/[^a-zA-Z0-9._-]/_/g ); 
done;

The addition of the caret (^) usually means match at the beginning of the line in a normal regex - however in the context where the brace ([ ]) operators are used in inverse the operation - so anything that does not match the specified is replaced with the underscore character.   

If we are happy with the results we can get cp to copy the files into our 'sanitised directory':

#!/bin/bash
cd /tmp/test
FileList=*
OutputDirectory=/tmp/output/
for file in $FileList; 
    do cp $file $OutputDirectory$(printf $file | sed s/[^a-zA-Z0-9._-]/_/g); 
done;

There are some limitations to this however - for example the above script will not work with sub directories properly - so in order to cater for this we need to make a few changes:

#!/bin/bash

if [ $# -eq 0 ]
  then
    echo "Usage: stripbadchars.sh <source-directory> <output-directory>"
    exit
fi

FileList=`find $1 | tail -n +2` # we need to exclude the first line (as it's a directory path)
OutputDirectory=$2
for file in $FileList
    do BASENAME=$(basename $file)
    BASEPATH=$(dirname $file)
    SANITISEDFNAME=`echo $BASENAME | sed s/[^a-zA-Z0-9._-]/_/g`
    # cp won't create the directory structure for us - so we need to do it ourself
    mkdir -p $OutputDirectory/$BASEPATH
    echo "Writing file: $OutputDirectory$BASEPATH/$SANITISEDFNAME"
    cp -R $file $OutputDirectory$BASEPATH/$SANITISEDFNAME
done

Note: Simple bash variables will not list all files recursively - so instead we can use the 'find' command to do this for us.

vi stripbadchars.sh
chmod 700 stripbadchars.sh

and execute with:

./stripbadchars.sh /tmp/test /tmp/output




0 comments:

Post a Comment