PPRuNe Forums - View Single Post - Proviation customer care problems

12th Aug 2013, 01:02

#133 (permalink)

abgd

Join Date: Sep 2011

Location: The Wild West (UK)

Age: 45

Posts: 1,151

Likes: 2

Received 6 Likes on 3 Posts

Quote:

As these are anonymous forums the origins of the contributions may be opposite to what may be apparent. In fact the press may use it, or the unscrupulous, or sciolists*, to elicit certain reactions.

Those with too much time on their hands may like to download the 'R' statistics program from the following site

The R Project for Statistical Computing

and the 'stylo' stylometric package:

https://sites.google.com/site/computationalstylistics/

You could then download the contents of the thread, and create a selection of files containing a representative sample of each contributor's posts. You might need to add a few posts from other threads in order to obtain sufficiently large samples of each author's 'style'. You might also need to exclude some authors whose contributions to pprune have been minimal.

Obviously this could be a lot of work, so you could use a script such as the following Matlab or Octave file to do most of it automatically:

function parse_file(fname)

a = fopen(fname, 'r')
b = fopen('scratch', 'w');

while(!feof(a)),
myline = fgetl(a);
charline = 0;
if length(myline)>20,
datel = myline((end-9):end);
if (strcmp(datel(1:4), "2012") && datel(8)==":") || (strcmp(datel(1:4), "2013") && datel(8)==":"),
datel
charl = sscanf(myline(1: (end-19)), "%s")
charl = charl(charl!='_');
charl = charl(charl!='-');
fclose(b);
b = fopen([charl, '_text.txt'], 'a');
charline = 1;
end
end
if charline == 0,
fputs(b, [myline, 14, 11]);
end
end

fclose(a);
fclose(b);

You'd have to do a small amount of quote removal manually so that each author's file only contained his own output.
You would then move these files to a directory named 'corpus' and run stylo, which produces dendrograms such as this:

where works are arranged by similarity; those that were written by the same author tend to be grouped together.

Now, there are upwards of 50 contributors to this thread, so you would expect there to be a number of people whose writing styles resemble each other simply by chance. On the other hand, if you were to make a prediction that two or more authors were actually the same person then run the analysis, and later discover that their posts were stylometrically similar, you could feel considerable justification for your viewpoint.

Should a person choose to do this, I would suggest using the default stylo settings as they're likely to provide the most valid outputs by default. You could then vary some of the settings to see whether your conclusions were robust, but it would be very bad practice to vary the settings until you got the answer you expected/wanted.

Last edited by abgd; 12th Aug 2013 at 01:13.

Reply