An approach to translation with possible uses in term-extraction and specialized dictionaries

This a draft on my present work on how to approach a text with a high rate of obscure terms.

My first goal was to arrange a text with different criteria, structural and grammatical, starting from a single document. For this purpose I use a .txt file which is parsed and saved again to further text processing. The original source is a manuscript, Peniarth 4 (Llyfr Gwyn Rhydderch) (p1r c1 l1 - p10r c38 l11), there is an on-line transcription by Thomas, Peter Wynn, D. Mark Smith and Diana Luft (2007) available at the site: Welsh Prose 1350–1425. Anyone interested in a detailed transcription of the first branch of the Mabinogi should visit these pages.

Main goal is the linguistic study of the foreabove mentioned excerpt. To allow a better understanding of the text I wrote several scripts to run over a tagged file. A new file was created after the version available on-line at http://cy.wikisource.org/wiki/Pwyll_Pendeuic_Dyuet , a text under the Creative Commons Attribution/Share-Alike License, Only raw text has been used from the file at the Wikisource, with no structural marks or tags. This was the starting point.

From here the text is tagged with a mininum number of tags to indicate page and column to be finally arranged as a whole as follows:

The text begins to appear meaningful and structured. From here, data is entered into a database with new word and sentence processing scripts. This is a basic search engine to look for any unit, and the first panel for user-text interaction. (Very basic seach capabilities!)

And finally the main goal with all this work, the first result that allows to infer some useful data: an ordered list of sentences to translate.
What is this? Well, I start from the assumption that the main difficulty of translating or understanding a text, provided that the grammar of the language is already known, is the number of unknown words.

Sentences here are arranged in terms of their value as context. Each sentence is valued as a whole in terms of the frequency of its elements or already processed elements. When processed, that is, when there is a definition in the lexicon, words would be given a maximum value as providers of context. While they are not processed, words with the highest frequency have a higher value, as they have more context to infer a meaning, and they are also more productive to bring context to a higher number of sentences. By now the value of each sentence is calculated by the mean of the frequencies of all its elements.

Next step is more dynamic. Sentences have to be reordered each time a new sentence (or even word unit) is solved. Solving here is understood as translating or inferring the meaning of the sentence and its elements.




Friday, 16 December 2011, 15:00 UTC
The ordered list of sentences is displayed according to total value of units, not in relative terms as it was previously intended. The medium value used is saved as a FLOAT in MySQL, when using 'ORDER BY sentencevalue DESC' sentences are not ordered. I'm using Excel to work with the data until I fix this issue in MySQL.

21:00 UTC Fixed after some hours of different approaches and a long break of another two hours too! : solved with a script in PHP that sets descending order after the last value shown at the end each sentence. Next week I should be able to write some paragraphs to explain what all those numbers are about and what do I want to do! :-)

Thursday, 5 January 2012, 10:00 UTC
Added a short explanatory paragraph on the index of sentences.

Friday, 6 January 2012, 13:15 UTC
I noticed yesterday that on pwylltaggedtei.txt, that is the main file used to arrange the text, html tags for pages showed missed and incorrect values on both the uploaded file on the server and my local mirror. I have checked all the attributes for pages and columns and uploaded the file again. The text is properly tagged now.

Wednesday, 1 February 2012, 18:00 UTC
A new file is used to run the scripts.

Tuesday, 14 August 2012, 7:00 UTC
Style: the word 'data' appeared with a plural demonstrative in a explanatory paragraph of this index. Deleted as it was redundant in this context, regardless of number.

Wednesday, 19 September 2012, 14:35 UTC
Index of unique terms given with number of occurrences and percentage over total text only. Partial percentage over number of unique terms is no longer shown.

Tuesday, 16 October 2012, 11:30 UTC
Scripts reviewed to add more compatibility with UTF-8 encoded texts. All occurrences of terms are shown in OCCURRENCES page, each number corresponds to the position of the term within the text.

Wednesday, 31 October 2012, 17:35 UTC
Sentences ordered according to different criteria and following assumption "highest number of units per sentente reveals more grammatical complexity". Main features of sentence ordering model: 1) more relevance of frequency (the fact that words appear more than once is considered as more relevant) evaluated as a product of frequencies 2) highest relevance of grammar complexity= number of units per individual sentence evaluated to the power of the mean number of units per sentence in corpus.

Wednesday, 31 October 2012, 18:00 UTC
Sentence number 394 seems to miss a punctuation mark. Revision of text and manuscript is needed.

Friday, 2 November 2012, 13:30 UTC
Question mark used as a sentence delimiter, as well as dots and colons. The whole text has been processed again and sentences database updated. Number of sentences increased from 463 to 498. Previous sentence reference index becomes obsolete.

Friday, 25 January 2013, 13:30 UTC
Headings at top and over select menu added to each ordered sentences list.

Friday, 1 February 2013, 15:30 UTC
Headings of ordered sentences changed to better describe the lists after an overall examination of the results. Deviation from expected results found in sentences with the minimum number of units for the lists with frequency as the most relevant parameter. Overall ordered lists fit the expected results. The model used to produce each ordered list should be tested against different alternative texts to better evaluate its performance.

Wednesday, 19 February 2014, 11:00 UTC
A copy of Pwyll database and scripts have been installed at a new server. Thanks to IT department at Mongolia International University.