Issue 549: revise TX5 Reading versus TX6 Transcription

ID:

549

Starting Date:

2021-09-29

Working Group:

Status:

Done

Background:

Posted by Martin on 6/9/2021

Dear All,

I belief that TX5 Reading and TX6 Transcription should be in a different relationship.

In more detail, I propose to rename TX5 Reading to "TX5 Text Recognition", and ontologically strictly separate observation from inferred interpretation of meaning, once TX5 Reading is declared as subclass of Observation, and TX6 Transcription is not.

Note that one can perfectly "read" a clear text written in a known script, without understanding any word. E.g., I can indeed copy well-written or printed Chinese Han characters without understanding any Chinese, just by knowledge of the relevant structural features. I assume the same holds for cuneiform. Equally, I can copy a Latin inscription without understanding any of the abundant abbreviations. This is indeed the proper observation.

If the result of this "reading" is a documentation in the same script and notation or not is a detail up to the reader. I'd argue, however, that the class TX5 needs a formal output, an instance of E90 Symbolic Object at least, in order to be useful. This is missing in the current model. Transcription in the sense of changing script of notation could be an internal, not documented intermediate step of the text recognition ("transcribing text recognition", or adequate output properties), or an explicit step after the recognition of the Symbolic Object.

It is obviously true that text recognition typically includes arguments of understanding. I'd argue, that this is not intrinsic to reading, but only applies to texts not clearly typed. Strictly speaking, any such process constitutes ERROR CORRECTION and text COMPLETION.

Therefore, I propose a new class "Meaning Comprehension", which would take as input a recognized text and interprets an assumed meaning in plain language, or even formal propositions, which would be the end-stadium of the reading process, resulting in an information object. This class may reside in CRMinf or in CRMtex.

We can then construct from "Text Recognition", "Transcription" and "Meaning Comprehension" combined and short-cutting constructs, which would include "error correction", "resolution of recognition ambiguity" and "missing part completion" as useful in practice for representing typical scholarly defaults.

I'd argue that resolution of linguistic ambiguity using scholarly arguments about the likely context of reference of the text constitutes a scholarly interpretation process after "reading", regardless whether error correction and completion used such arguments.

We need these separations, in order to create a clear interface to "Belief Adoption" in CRMinf, which is about the assumed real world truth of statements in texts.

Opinions?

Posted by Achille & Francesca on 12/09/2021

Dear Martin,

Your observations are extremely stimulating, as usual, especially with regard to the observation of a linguistic object which is a very complex and articulated operation. In our view, in particular, the following conditions can occur while a linguistic object is observed:

1. I observe some signs on a surface without even realising that it is a text.
2. I understand that it is a text but without understanding its meaning (typically, even without understanding what language or writing system it is).
3. I actually read and understand text.

We agree with your observation that the relationship between these types of observation needs to be better specified. We therefore propose to keep TX5 Reading as the most specific observation (type 3) and to define one or more classes for the other observation cases of which TX5 could become a specific one.

As regards the Transcription, for the epigraphists this type of operation has a rather broad meaning that covers various cases, from the “exact” reproduction of the signs, to their stylised rendering, up to the transliteration using a different script. In the first cases, transcription does not necessarily imply an understanding of the signs (e.g. see publications on texts in unknown alphabets such as the one on the Phaistos Disc).

The other ideas and new classes you propose, relating to the other cases, are very intriguing, we are thinking about them, but they probably need a more articulated discussion.

Posted by Martin on 16/9/2021

Dear All,

I would describe things a bit differently.

I believe we should consider the following:
Very generally, in order to observe a sign as a sign, we need a context fitting to some hypotheses what people would put signs on, characteristics suggesting a feature as human made, but most importantly a finite selection of potential patterns. The latter condition is absolutely necessary. The result will be a ranked list of best fitting patterns.

This means that the observer must have previous knowledge of possible sign sets. This equals to partial knowledge of the writing system. There

In very rare cases a single context itself will provide a large enough set of distinct, non-random patterns in order to infer the sign set used. In these cases, typically the arrangement of signs will support hypotheses that the signs are not just decorative but form messages (regular distances, directions, alignments). This process can again be regarded as creating partial knowledge of the writing system prior to observing (recognizing) a single sign.

A variant of the latter occurs when reading a sloppy hand-written text, e.g., when I read my grandpa's notebook. The character set can only be either Latin or Sütterlin. I can distinguish the latter by recognizing a few very characteristic signs in the overall text. Then, I need to identify more and more words in the text until I am trained to more and more of Grandpa's own variants of the Sütterlin character set. This depends on the size of the sample. Only then I can go back and recognize within the continuous lines single signs.

In short, I maintain that recognizing a text can be prior to recognizing single signs.

I believe the definition of the "writing system" needs to differentiate the character set sufficiently from the other constituents of a writing system.

The arrangement of characters into texts are also standardized by a writing system. The arrangement rules of a writing system for signs in a text create also characteristic, recognizable patterns. There are enough archaeological cases, I assume, in which a text can be recognized as such without any readable character on it.

Hence, there are in addition observable arrangement features of a writing system.

Note that all results of observation in an encoded propositional form require that the observation itself applies hypotheses about the forms the feature to be observed can appear in.

Comments?

Best,

Martin

Post by Achille (1 October 2021)

Dear Martin,

Please find our comments inline.

Very generally, in order to observe a sign as a sign, we need a context fitting to some hypotheses what people would put signs on, characteristics suggesting a feature as human made, but most importantly a finite selection of potential patterns. The latter condition is absolutely necessary. The result will be a ranked list of best fitting patterns.

We agree that some characteristics of some signs (e.g. their shape, the physical properties of how they are traced, etc.) can contribute to the process of interpreting them as intentionally produced human products.
In principle, in fact, it is not the specific characteristics of the signs but their properties of occurrence on a surface that discriminate their interpretation as part of a (writing) system or not: for example, the recognition of Linear B or the interpretation of the signs of the Phaistos disc as "scripture" depend on the reciprocal relationships that the elements have among themselves, which therefore lead to the hypothesis of the existence of a system. It is a matter of fact that, within a given system, "tout se tient" as Saussure and Chomsky underline.

In short, I maintain that recognizing a text can be prior to recognizing single signs.

We fully agree also on this point.

I believe the definition of the "writing system" needs to differentiate the character set sufficiently from the other constituents of a writing system.

Hence, there are in addition observable arrangement features of a writing system.

Note that all results of observation in an encoded propositional form require that the observation itself applies hypotheses about the forms the feature to be observed can appear in.

We don't entirely agree on the existence of a set of predefined possibilities; rather, we would say that there is a number of concomitant phenomena that makes interpretation of the world possible, in this case, the activity of "writing". It is possible to create infinite writing systems with infinite varieties in the signs that compose it, because, indeed, what is important is that the internal coherence of the system is preserved.

For this reason it is not possible to create predefined lists of rules valid for all writing systems, except than for giving a more general definition of a system as a functional unit consisting of several parts (in this case signs) in functional relationship between each other. But it is not possible to list a priori which and how many functional relationships a given system will have. Roy Harris's "La sémiologie de l'écriture" is very enlightening in this regard.

If by "arrangement of characters into texts" you mean things like, for instance, the "ductus", i.e., the direction of the writing of the text (from right to left, from left to right, etc.), this is actually not a "rule" of the writing system as well, but just the application of a possible model to a text. It is true that a writing system can have a preferred ductus and that this can eventually lead to the standardisation of a certain use, but this does not concern the system itself: if I write in English from left to right (or in a specular form, as Leonardo da Vinci used to do for Italian), the internal rules of the system (differentiation of signs, combination of letters etc.) remain unchanged. The reader is simply less used to a different use. In fact, some writing systems do not have a preferential ductus (see for instance Archaic Greek). Thus, a certain set of observable disposition characteristics occur IN the text, not in the writing system. This is a subtle but necessary semiological difference.

As for the Sütterlin, again this phenomena does not refer to the system but just to the shape of the signs, the way they are linked together etc., and therefore it concerns their "style", not the internal differentiability of the signs themselves. Within the Latin script ecosystem, the Sütterlin has exactly the same stylistic significance of the Caroline minuscule.

In summary: the possible styles that can be manifested in writing are infinite, as long as the signs maintain a degree of internal recognition (i.e., the "differential" value remains unchanged) and remain recognisable by the reader. See the letter "T" example in the attached image from Saussure's "Course" (CLG, section 165) for more details.

We hope this helps :-)

Regards,
Francesca & Achille

Post by Martin on 1/10/2021

Dear Achille, we widely agree, but, I simply meant more general, more fundamental principles.

On 10/1/2021 6:47 PM, Achille Felicetti wrote:

Dear Martin,

Please find our comments inline.

Very generally, in order to observe a sign as a sign, we need a context fitting to some hypotheses what people would put signs on, characteristics suggesting a feature as human made, but most importantly a finite selection of potential patterns. The latter condition is absolutely necessary. The result will be a ranked list of best fitting patterns.

We agree that some characteristics of some signs (e.g. their shape, the physical properties of how they are traced, etc.) can contribute to the process of interpreting them as intentionally produced human products.
In principle, in fact, it is not the specific characteristics of the signs but their properties of occurrence on a surface that discriminate their interpretation as part of a (writing) system or not: for example, the recognition of Linear B or the interpretation of the signs of the Phaistos disc as “scripture” depend on the reciprocal relationships that the elements have among themselves, which therefore lead to the hypothesis of the existence of a system. It is a matter of fact that, within a given system, "tout se tient" as Saussure and Chomsky underline.

Yes, of course, this is exactly what I meant below: "The arrangement of characters into texts are also standardized by a writing system".

Well, yes, I meant the above very generally. You may find mason marks somewhere on a block of stone, a single sign possibly, at no particular position, but you find on blocks of the same context, or a comparable context, similar isolated, deliberately applied scratches, people have learnt to recognize as being functional.

In short, I maintain that recognizing a text can be prior to recognizing single signs.

We fully agree also on this point.

I believe the definition of the "writing system" needs to differentiate the character set sufficiently from the other constituents of a writing system.

The arrangement of characters into texts are also standardized by a writing system. The arrangement rules of a writing system for signs in a text create also characteristic, recognizable patterns. There are enough archaeological cases, I assume, in which a text can be recognized as such without any readable character on it.

Hence, there are in addition observable arrangement features of a writing system.

Note that all results of observation in an encoded propositional form require that the observation itself applies hypotheses about the forms the feature to be observed can appear in.

We don't entirely agree on the existence of a set of predefined possibilities; rather, we would say that there is a number of concomitant phenomena that makes interpretation of the world possible, in this case, the activity of “writing". It is possible to create infinite writing systems with infinite varieties in the signs that compose it, because, indeed, what is important is that the internal coherence of the system is preserved.

Of course, this is not what I said

We must VERY carefully distinguish, what we infer as rules from multiple observations. If we observe a bustrophedon writing, there is still a linearity in the characters and EVEN indication of direction. It is "character - space - character -space..., and spaces use to be similar, lines are horizontal. You will not write 1 2 3 4 5 6 in the order 2 1 4 3 6 5. Why not? Could be cryptography. How many characters would you need in a linear sequence to understand that they have to be read in "2 1 4 3...order? and yet, the arrangement cannot be random, but must follow a rule. The old art of cryptography tells us a lot about variants that make a text unrecognizable. If the character order is random but only known to one sender and receiver, we have already a big problem deciphering!

If by "arrangement of characters into texts" you mean things like, for instance, the “ductus”, i.e., the direction of the writing of the text (from right to left, from left to right, etc.), this is actually not a "rule" of the writing system as well, but just the application of a possible model to a text. It is true that a writing system can have a preferred ductus and that this can eventually lead to the standardisation of a certain use, but this does not concern the system itself: if I write in English from left to right (or in a specular form, as Leonardo da Vinci used to do for Italian), the internal rules of the system (differentiation of signs, combination of letters etc.) remain unchanged. The reader is simply less used to a different use.

I don't mean ductus, or any specific kind of arrangement. I mean whatever the necessary arrangement rules are.
I think you miss the concept of sequence here, which is the basis of all arrangements. Language is linear - sequential, and writing will have a concept of corresponding linearity and regularity of distances between signs. It's a bit like rhythm in music.

In fact, some writing systems do not have a preferential ductus (see for instance Archaic Greek). Thus, a certain set of observable disposition characteristics occur IN the text, not in the writing system. This is a subtle but necessary semiological difference.

Sure, simply, the arrangement rules are of different nature. The concept of sequence cannot be abandoned. Otherwise, you have a pot of letters, and the reader should put them in an order to create the intended message. That would, obviously, not work for longer and not pre-agreed messages.
Whatever the nature of the necessary rules are, either they are a priori known, or can only be inferred from longer examples, and yet, inference will not work without a sequence hypothesis.

See also Korean script. It is not as linear as ours, but still a regular path, a bit convoluted.

A variant of the latter occurs when reading a sloppy hand-written text, e.g., when I read my grandpa's notebook. The character set can only be either Latin or Sütterlin. I can distinguish the latter by recognizing a few very characteristic signs in the overall text. Then, I need to identify more and more words in the text until I am trained to more and more of Grandpa's own variants of the Sütterlin character set. This depends on the size of the sample. Only then I can go back and recognize within the continuous lines single signs.

As for the Sütterlin, again this phenomena does not refer to the system but just to the shape of the signs, the way they are linked together etc., and therefore it concerns their “style", not the internal differentiability of the signs themselves. Within the Latin script ecosystem, the Sütterlin has exactly the same stylistic significance of the Caroline minuscule.

This is not correct. I can show you my Grandpa's notebook. It is real. Internal differentiability of the signs themselves is definitely not always given. I do not know a priori if he writes Latin or Sütterlin. He did both. The first thing I do, check until I recognize a distinct Sütterlin. But when he wrote Latin, he may throw in a Sütterlin some times. Sloppy ligature and shorthand does not always allow to differentiate the signs. I need many examples until I am familiar with recognizing small words as a whole characteristic for him individually, because the signs are fused.

In summary: the possible styles that can be manifested in writing are infinite, as long as the signs maintain a degree of internal recognition (i.e., the “differential" value remains unchanged) and remain recognisable by the reader. See the letter “T” example in the attached image from Saussure’s “Course” (CLG, section 165) for more details.

I actually do not agree. Infinite is a big word. Mathematically, with infinite arrangements and styles, you can interprete any text as being any other text.
Simply, we must be more clear what are the real features we rely on. I kindly invite you to do a reading exercise in my Grandpa's notebook.
Styles need to maintain a similarity with the standard shape. What deviations are tolerable to be recognized by human brain has some constraints. Currently, newer neural networks do a good job identifying the rules behind the human anticipation of symbol similarity.

I think the text below is a bit too simplistic. It refers to characteristics that are not relevant, but nevertheless to a "within certain limits". It does not distinguish the consistency between characters in the same text, in texts of the same writer, and between texts of of different writers. It does not say what the "certain limits" consist of. But this is what we need to understand. Isn't it?

All the best,

Martin

We hope this helps :-)

Regards,

Francesca & Achille

Post by Martin on 1/10/2021

Dear Francesca, Achille,

I'd like to add: As far as I have read recently, there exist very few occasions in the world of independent development of a writing system. For the things we study currently, to find another such center is very unlikely (nevertheless possible), and reading would not be possible immediately. All other writing systems are either adaptations and modifications of systems from cultures in contact, or reinventions under the impression of writing systems that have come to their knowledge. This reduces actually the variants of principles we expect in practice, beyond what is theoretically possible. Isn't it?

By the way, interesting is which kinds of inferences would allow for determining the reading direction of the Phaistos Disc?

All the best,

Martin

In the 51st CIDOC CRM & 44th FRBRoo SIG meeting, it was decided that the scope notes of Tx5 Reading, Tx6 Transcription, Tx8 Grapheme and TxP11 transcribed (was transcribed by) need to be redrafted in order for them to:

clearly distinguish btw atomic units of writing vs combinations thereof, by means of introducing a concept of Txx Grapheme Occurrence Sequence and associate it with an instance of E73 Information Object (or a specialization)
showcase the relation btw glyphs(/monographs…) and graphemes
explain the correspondences btw transcriptions from one type of writing system to another and the parts of written text that get transcribed

HW: AF, FM, PR, MD. TV to proofread

5 September 2022

Link to the HW by Achille, Francesca and Martin for the issue.

In the 54th CIDOC CRM & 47th FRBR/LRMoo SIG meeting, the SIG went through the HW by Achille, Francesca and Martin.

A summpary of the proposal made can be found below. For scope notes and an overview of the proposed modelling constructs --see attached.

redraft TX5 Reading,
introduce new properties for TX5 Text Recognition: TXP10 deciphered text, TXPxx1 deciphered via the representation, TXPxx2 used copy or representation of, TXPxx3 recorded transcript
introduce new class TX6 Transliteration
introduce new class TX8 Grapheme
introduce new class TXxx1 Grapheme Occurrence
introduce new class TXxx2 Grapheme Sequence
introduce new class TXxx3 Script
redefine TXxxx Reading

Discussion Points:

overall emphasis on digital representations needs explaining: Digital renditions of a particular object have different identity conditions wrt. the original object or physical representations thereof.
TX8 Grapheme a case of E90 Symbolic Object instead of E55 Type? Why make the abstract symbol a specialization of E55 Type and not of E90 Symbolic Object (and assign it a type)?
- Characters correspond to universals wrt. the Script/Writing System they are part of. Characters used in an inscription are particular instances of that type.
- Character Occurrences are specialisations of E90 Symbolic Object, and the type they are assigned is TX8 Grapheme.
- Graphemes are defined in the context of a Writing System. Even for alphabetical writing systems that assume a more or less phonetic orthography, graphemes do not necessarily stand in a one-to-one correspondence with the phonemes they represent.
The possibility of making TXxxx Reading IsA TX5 Text Recognition and IsA Ixx Meaning Comprehension
Cross-referencing a model with not only CRMbase but other family models too, means that creating a stable version in that particular model is dependent on using stable versions of the relevant family models

Decisions:

The proposal was overall accepted. Some minor editing also took place.
- HW: AF & FM to provide examples for the classes and properties that lack them (TXPxx2 used copy or representation of, TXPxx3 recorded transcript, TXxx2 Grapheme Sequence). Assign numbers to the newly introduced classes and properties.
When the properties for Script and Reading have been admitted into the model, produce a stable version of CRMtex. That version will need to be harmonised with the newest stable/official version of CIDOC CRM. We do not to harmonise compatible models with every newer stable version, but every now and then (in a period of a few years) to carry on with that

Rome, September 2022

In the 55th joint meeting of the CIDOC CRM and SO/TC46/SC4/WG9; 48th FRBR/LRMoo SIG meeting, the SIG reviwed HW undertaken by AF, FM, MD & PF. A summary of the decisions can be found below. For the details of the proposals/discussions/decisions see attached document.

Decisions:

Proposed definition of TXP16 employs script was approved
AF & FM are assigned HW to redraft the definitions of TXP7 has item, TXP17 has part, TXP18 read and send them out on an e-vote
deprecation of TXP3 rendered was approved
the examples for TXP14 used copy or representation of, TX15 recorded correspondence, TX12 Grapheme Sequence were approved
the proposal for the introduction of an FOL axiom (also rendered in prose) for TXP10 deciphered text and TXP14 used copy or representation were approved
CRMtex editors to revise the FOL axiom (also render it in prose) for TXP13 deciphered via representation and send it out for an e-vote.

Nb: no decision was reached wrt the number of the new release

Belval, December 2022

Current Proposal:

Post by Pavlos Fafalios (26 April 2023)

Dear all,

We (Achille, Francesca, Martin, Pavlos, Eleni) are preparing the new stable version of CRMtex, following the discussion and decisions during the last SIG meeting in Belval (see Issue 549).

For those interested, a depiction of the new full model is available here, and the full draft document of the new version is available here.

Please consider the following latest (minor) changes for an e-vote. Please vote YES if you are happy with the changes/inclusions, or vote NO and explain which ones are not good enough. For minor issues, please vote YES with a note for the correction.

============================

(A) Revision of scope note of 'TXP7 has item'

Domain: TX13 Script

Range: TX8 Grapheme

Revised scope note: This property associates an instance of TX13 Script with an instance of TX8 Grapheme employed by this script. Different instances of TX13 Script may have some graphemes in common, but not all.

(B) Revision of scope note AND new example of 'TXP17 has part'

Domain: TX12 Grapheme Sequence

Range: TX12 Grapheme Sequence

Revised scope note: This property associates an instance of TX12 Grapheme Sequence with another instance of TX12 Grapheme Sequence appearing at a particular position of the sequence. The property can be also used by an instance of TX11 Grapheme Occurrence (subclass of TX12 Grapheme Sequence) for denoting that a grapheme occurrence has part another grapheme occurrence. Note that a grapheme occurrence may be a symbolic composite containing another grapheme occurrence, such as the minute character “e” on top of the character “u” in former German writing systems denoting the symbol for “ü”.

New example:
- The recognition of the “DIVINITATIS” grapheme sequence (TX12), corresponding to the glyphs sequence of the inscription (TX1) on the Arch of Constantine, TXP17 has part the recognition of the “AT” grapheme sequence (TX12) which appears to be damaged.

(C) Revision of scope note of 'TXP18 read'

Domain: TX14 Reading

Range: TX1 Written Text

Revised scope note:This property associates an instance of TX14 Reading with an instance of TX1 Written Text whose linguistic meaning was interpreted/understood through the reading process. It is a shortcut of the fully developed path from TX14 Reading through P9 consists of, TX5 Text Recognition, TXP10 deciphered text, to TX1 Written Text.

(D) New FOL statement of 'TXP13 deciphered via the representation'

TXP13(x, y) ⇒ (∃z) [TXP14(x, z) ∧ P138(y, z) ^ ¬TXP10(x, z)]

// FOL explanation:

If the text recognition process deciphered via a representation of the original text (a visual item), then:

- there is a written text ‘z’ whose copy or representation was used in the text recognition process (TXP14),

- the visual item represents the written text (P138), and
- the text recognition process did not decipher the written text through an autoptic investigation.

Why to include this statement (text by Martin):

It says that the specific original has not been used in this process. The Closed World is limited to this particular activity, which is reasonable to be completely observed. The FOL is important because it clarifies the provenance of knowledge. Whatever was recognized in this particular process is to be found on this specific copy. We have added "if this particular process" in the scope note, to make clear that it is not a global negation.

(E) Revised scope note AND new example of TX3 Writing System

Revised scope note: This class represents a conventional symbolic system designed to represent units of a natural language with the purpose of recording and transmitting information. A writing system consists of a set of symbols (graphemes, TX8), instantiated through physical signs of a visual or tactile nature (TX8 Glyphs) representing linguistic units of any kind and the related syntactic (graphotactic) rules.

It is used to produce a TX1 Written Text during a TX2 Writing event.

New example: The Latin alphabet using the glyph “E” to encode the mid front unrounded vowel /e/.

(F) New version number

Based on the 3-tier numbering of versions that we had accepted in issue 310 and revised in Rome in the context of Issue 599, the numbering should reflect how drastic are the changes that have been implemented from the last release of the model.

The new version will be 2.0 because it implements major changes in the model.

===========================================

Thank you!

Best regards,

Pavlos

Post by Gerald Hiebel (4 May 2023) [in response to the e-vote initiated by Pavlos Fafalios on April 26)

YES

And thank you to all of you, that is most relevant for our linguistic and historic projects!

Outcome:

In the 56th joint meeting of the CIDOC CRM and ISO/TC46/SC4/WG9 &49th FRBR/LRMoo SIG, the SIG reviewed HW by the maintainers of CRMtex, which involved redrafting of scope notes, examples and FOL statements for the following set of classes/properties

TXP7 has item
TXP17 has part
TXP18 read
TXP13 deciphered via the representation
TX3 Writing System

The proposed changes were accepted by the SIG. The redrafted class/property definitions can be found in the attached document.

Overall decision concerning the new release of CRMtex:

The new version will be released shortly after the SIG meeting, following some minor, editorial revisions. It will bear the number 2.0, as it implements major changes in the model (new properties, redrafting of existing ones, etc.)

Issue closed

Crete, May 2023.

Reference to Cidoc Version:

Version 7.1.2