Thursday 13 July 2017

Script to remove bad characters from a set of files

The need for this script was prompted by a series of files being uploaded to Sharepoint which had special characters within their filenames such as an astrix or tilde.

Although there are many ways to achieve this I chose for a simplistic approach using cp and sed.

We can use the sed substitute function to replace any bad characters - we have the following directory we wish to 'cleanse':

ls /tmp/test

drwxrwxr-x.  2 limited limited  120 Jul 13 13:45 .
drwxrwxrwt. 40 root    root    1280 Jul 13 13:43 ..
-rw-rw-r--.  1 limited limited    0 Jul 13 13:45 'fran^k.txt'
-rw-rw-r--.  1 limited limited    0 Jul 13 13:44 @note.txt
-rw-rw-r--.  1 limited limited    0 Jul 13 13:44 'rubbi'\''sh.txt'
-rw-rw-r--.  1 limited limited    0 Jul 13 13:43 'test`.txt'

We can run a quick test to see what the results would look like just piping the result out to stdout:

cd /tmp/test
for file in $FileList; 
    do (echo $file | sed s/[\'\`^@]/_/g ); 

Note: The 'g' option instructs sed to substitute all matches on each line.

Or an even better approach (adapted from here):

cd /tmp/test
for file in $FileList; 
    do (echo $file | sed s/[^a-zA-Z0-9._-]/_/g ); 

The addition of the caret (^) usually means match at the beginning of the line in a normal regex - however in the context where the brace ([ ]) operators are used in inverse the operation - so anything that does not match the specified is replaced with the underscore character.   

If we are happy with the results we can get cp to copy the files into our 'sanitised directory':

cd /tmp/test
for file in $FileList; 
    do cp $file $OutputDirectory$(printf $file | sed s/[^a-zA-Z0-9._-]/_/g); 

There are some limitations to this however - for example the above script will not work with sub directories properly - so in order to cater for this we need to make a few changes:


if [ $# -eq 0 ]
    echo "Usage: <source-directory> <output-directory>"

FileList=`find $1 | tail -n +2` # we need to exclude the first line (as it's a directory path)
for file in $FileList
    do BASENAME=$(basename $file)
    BASEPATH=$(dirname $file)
    SANITISEDFNAME=`echo $BASENAME | sed s/[^a-zA-Z0-9._-]/_/g`
    # cp won't create the directory structure for us - so we need to do it ourself
    mkdir -p $OutputDirectory/$BASEPATH
    echo "Writing file: $OutputDirectory$BASEPATH/$SANITISEDFNAME"
    cp -R $file $OutputDirectory$BASEPATH/$SANITISEDFNAME

Note: Simple bash variables will not list all files recursively - so instead we can use the 'find' command to do this for us.

chmod 700

and execute with:

./ /tmp/test /tmp/output


Post a Comment