Yesterday I posted a request for information on the use of accented characters
in French .. after a few replies I think I need to clarify my message a bit.
I have a corpus of Swiss-French newsagency reports and I've taken a sample
of about 86 Million characters from this and counted character occurrences
as follows:
e - 7.67M a - 4.44M i - 4.06M o - 2.96M u - 2.86M
e-aigu - 1.3M a-grave 0.22M *ALL* other accented characters - .28M
This struck me as rather a small percentage for accented characters so I took the
word "pre/sident" (e/ equals e-aigu) and found 48,677 occurrences while for the word
"president" I found 9,644 occurrences. Initially this struck me as evidence of "lazy"
journalism though somebody pointed out that these are two different words. On
inspecting a couple of pages of these occurrences I do indeed find examples of
"le president des Etats-Unis" (which is correct) but I also find "le president
Francois Mitterrand" (which is incorrect) but I don't want to have to count all
these.
I have no basis for my intuition that the occurrence of accented characters is less
than expected except that I do remember a similar situation arising for a
corpus of Mexican Spanish newspaper texts where "lazy" journalism led to the dropping
of accents and I wondered whether the same situaltion was true here.
So my question is this: does anybody know whether the relative numbers of occurrences
of accented characters as shown above, is normal ?
The reason I'm chasing this information is that I am evaluating an information
retrieval application based on the shapes of words and letters where the accented
characters and the letter "i" all have the same shape.
Thanks for helping
- Alan Smeaton