Issue 490: How to model a file

ID: 
490
Starting Date: 
2020-04-15
Working Group: 
3
Status: 
Open
Background: 

Posted by George on 15/4/2020

Dear all,

Here is another humble modelling problem for which I don't feel that there is a commonly agreed and documented answer, although it is a common question. How do we connect an actual file with the semantic network? So here is the scenario.

I have a file: a word doc, a jpg image, a powerpoint. I want to represent it in CIDOC CRM and connect it the semantic network and do so in a way that would be interoperable with all other well formed instances of CIDOC CRM. How do I do that? 

Well part of the answer is clear. Part is unclear. Regarding the representation of the the fact that there is a digital object we have two choices. If we use pure CRMbase then we have

E73 p2 has type E55 "Digital Object"

If we use CRM extensions then we have

D1 Digital Object

Great. Now in the semantic network we can relate this in all sorts of standard ways to other entities (p67 refers to, p128 is about) etc. etc. We can use a creation event from CRM base or a digital machine event from CRMdig to document when the file was created, by whom etc. Super. I can use p1 is identified by E41 appellation to indicate the name of that digital object (which may differ from the file name) and give it a type with p2 has type. All standard and wonderful.

I still have to put the file itself, that actual digital object which I want my user to be able to find and manipulate somehow in relation to the semantic network. 

How do people tend to do that? I have seen many variation but no common method.

So what is the go-to solution and should it perhaps be documented on the CIDOC CRM site because it is a really common pattern?

I have seen

the file = E73... just put the file as the URN of the semantic node. But then this means your file is accessible via a URN which is often not the case and anyhow you probably want to distinguish your semantic node which 'stands for' the file from the actual file itself. 

I have seen and used E41 Appellation as a pattern. So the D1 or E73 p1 is identified by E41 Appellation p190 has symbolic content df:literal "file name value goes here". Here you have a problem that you then need also to store somehow a path by which to reach that on some file system.

I guess another alternative would be to use p190 has symbolic content and then throw the file in there as a blob. I don't particularly like this solution, as I would hope to find strings at the end of p190 and not blobs.

Would maybe a sub property of p190 'is encoded in file' be an option in order to use the blob solution? 

Anyhow maybe there are already better solutions than I lay out above, but I would be interested to hear. Also I think it would be great to identify the best practice and put in on the main site so that people follow this strategy consistently. 

Probably my examples hide multiple use cases requiring different patterns. Anyhow, what do you think?

Current Proposal: 

Posted by Daria Hookk on 16/4/2020

Dear friends,

I am not sure for clear understanding the model of "Activity", please check it if the situation is repeatble (reality):

step 1 - modeling

step 2 - decision

step 3 - return to the step 1 (re-modeling, improvements or renovation)

The same thing exiats in digital world as in reality and even fixed in State standrds (in Russia).

 

Posted by Martin on 16/4/2020

 

You are right, this is an open and important question.

I have repeatedly pointed to that issue in CRM-SIG Meetings, but with limited response so far.

One part of the discussion had been in CRMInf, about the equivalence of a file with a Proposition Set.

The other part has been when introducing P190 has symbolic content. My remark is not even in the minutes, that this must be discussed. May be I forgot a homework to elaborate this more. Let me expand here a bit my current understanding:

This: E41 Appellation p190 has symbolic content df:literal "file name value goes here" feeds only the name of the file, which identifies the file, into the Appellation. It could be just an rdf label, because the content of the Appellation is hardly ambiguous. On the other side, reserving another node for the Appellation allows for assigning a type "filename" to it. But a filename is anyhow not a good identifier.

If the Digital Object is represented by a URI, e.g., a DOI, the remaining question is, if it resolves or can unambiguously be related to an external content or not.

If it does, then the identity of this Digital Object should be the "primitive" one, its binary identity. I.e., a .pdf and .doc of the same scientific publication would be different objects, even a .doc with changes in embedded metadata would make it different.

If we mean however that the ontological identity is, for instance, that of the equivalence class of possible encodings of one certain publication following Springer rules or so, the URI pointing to a binary is misleading, because many files can represent the same publication.  The different encodings will both incorporate and represent the respective publication, but both properties are not identifying the content.

Therefore, a variation (not subproperty) of P190 should  do it. We have again the problem, that we need to form a common superclass with a Primitive Value.

Perhaps, once we have done the great step and declared some Primitive Values as IsA Appellations, the most elegant form would be to form a superclass of E62 String and  Digital Object, and raise the range of P190 to it. This would elegantly make clear that E62 String and Digital Object differ only in the fact if they are in or out of the KB proper.

If we do that, the range of P190 will again point to a URI, which, in this case, either must be the binary, or a lower representation than the level of symbolic specificity given for the domain instance. In any case, we should reach at a "tangible" binary, and a suitable type to distinguish, if the URI is meant to correspond to a real binary (even if no more extent!!), or to a higher level may be useful.

We should also answer the question, how this translates to analogue content, because we may copy files manually and re-encode.

After that, we should think about Propositional Objects represented in files...

Any thoughts?

Posted by Martin on 16/4/2020

...in addition, we should think about digital images of printed material providing the identity of content for a lost physical item...

In the 47th joint meeting of the CIDOC CRM SIG and ISO/TC46/SC4/WG9; 40th FRBR - CIDOC CRM Harmonization meeting; the sig  decided to support the proposal of MD to introduce a new property that describes the relation btw URIs standing for content of different symbolic specificity. The issue is about the intuitions we have concerning the identity of a file we point to by a URI and not about the actual identity of the file in IT terms. The IT identity would be binary, but we’re thinking in terms of content, which is much more abstract.   TV will add a  paragraph about the implementation in rdf document and HW was assgined to MD & GB to work towards new property definition asking for advice by the Libraries communities.

June 2020

 

Posted by Martin on 16/10/2020

Here my trial for a property "P190 prime" :

Pxxx has representative content

Domain:

E73 Information Object

Range:

E73 Information Object

Subproperty of:

E73 Information Object. P165i is incorporated in (incorporates): E73 Information Object

Quantification:

many to many (0,n:0,n)

Scope note:

This property associates an instance of E73 Information Object with a complete, identifying representation of its content in the form of another instance of E73 Information Object.

This property only applies to instances of E73 Information Object that can completely be represented by discrete symbols, in contrast to analogue information. The representing object may be more specific than the symbolic level defining the identity condition of the represented. This depends on the type of the information object represented. For instance, if a text has type "Modern Greek character and punctuation marks sequence ", it may be represented in a formatted file with particular fonts, meaning however only the sequence of Greek letters. Any additional analogue elements contained in the representing object will not regarded to be part of the represented.

As another example, if the represented object has type "English words sequence", American English or British English spelling variants may be chosen to represent the English word "colour" without defining a different symbolic object.

In a knowledge base, typically, the represented object will appear as a URI without a corresponding file, whereas the representing one will appear by the URI of a binary encoded file existing outside the knowledge base proper, or even a paper edition.

Posted by George on 19/10/2020

Dear Martin,
The scope note reads nicely. 

Questions:

1) why domain and range E73. If this does not count for things analogue then shouldn't it be E72 to D1? Do we just not do this because it is a reach out from base to family model? Does this raise the problem again of the limits of models? I think it would be very good for CRM to have ready solutions for very standard problems of representation. This property seems to go in that direction, but I wonder if it then raises the question of if a D1 should move up into base?

2) would this be a sub property of p190 and if not, why not?

 

Posted by Martin on 19/10/2020

Hi George,

On 10/19/2020 10:52 AM, George Bruseker wrote:

Dear Martin,

The scope note reads nicely. 

 

Questions:

 

1) why domain and range E73. If this does not count for things analogue then shouldn't it be E72 to D1? Do we just not do this because it is a reach out from base to family model? Does this raise the problem again of the limits of models? I think it would be very good for CRM to have ready solutions for very standard problems of representation. This property seems to go in that direction, but I wonder if it then raises the question of if a D1 should move up into base?

I propose that it does count for things with analogue parts, it simply does not provide a content identity to the analogue parts, the typical illustrations and analogue layout. For instance, take the "Hobbit" in e-book edition without illustrations, and ask if the text is identical with the original, with that of a pdf version rendering the original images, and with that of a digital copy of the original.  It does count for paper editions. Therefore it is more general than P190.

Needs some good examples, but any scientific publication available in .doc and .pdf  does the job.

Posted by George on 19/10/2020

Dear Martin,

Ok so how does it perform the function of supporting how to model a file? Does p190 then become the property that exclusively handles blobs?

Posted by Martin on 19/10/2020

On 10/19/2020 7:56 PM, George Bruseker wrote:

Dear Martin,

 

Ok so how does it perform the function of supporting how to model a file? Does p190 then become the property that exclusively handles blobs?

Hi George,

I think I propose something that works for files as well. The question is, do we mean the binary identity of an E73, or another one?
P190 is for the identity of the inline content, the String value. If a blob is modelled as a String or as a copy of an imported file, is another question.

I think I have to add another phrase that a binary E73 defined by a URI for a file has bit-wise identity (checksum), and nothing else. But this touches issues of embedded metadata that are not our business.

In the 50th joint meeting of the CIDOC CRM SIG and SO/TC46/SC4/WG9; 43nd FRBR – CIDOC CRM Harmonization meeting, the SIG reviewed HW by MD (scope note of Pxxx has representative content). 
The text was considered too abstract & dense and in need of redrafting. 

Decision: CEO, TV, OE to provide an alternative formulation that will be discussed at a later stage.

Details of the scope note and the discussion that followed MD's presenting the HW can be found here
June 2021
 

In the 53rd CIDOC CRM & 46th FRBRoo SIG meeting, CEO walked the Sig through the issue: 

  • The solution to GB’s original question is as follows: The abstract content of a file is an instance of E73 Information Object. The actual content of the file (sequence of symbols) can be represented by an instance of E62 String. The link to the abstract content could be given through a *has semantic encoding* property. 
  • MD had proposed introducing a property (Pxxx has representative content) to connect two instances of E73 Information Object. The scope note was discussed at length during the 52nd Sig meeting, it was not very clear and had not been accepted then. 
    The property proposed by MD Pxxx has representative content (see here) can mark an equivalence relation btw different instances of E73. 
    This forms an additional solution, not what GB was asking. 
  • The new property expresses an equivalence relation between instances of E73 Information Object that are also instances of D1 Digital Object in CRMdig. 

HW: MD, TV, CEO 

  • To redraft the text of the scope note & provide examples that are not exclusively European-centric. 
  • To address the practical issue (question by GB), namely: How do I connect a file with a semantic network. Add something to the rdf guidelines. Distinguish among a file as a physical copy held by a digital library and referring to “my file” by a URI. 

 

May 2022

Post by Martin Doerr (16 September 2023)

Dear All,

Let me summarize the discussion about issue 490 between George, Christian-Emil and me, to be discussed in the next meeting:

"How to model a file" may be too vague.

There are three aspects:

A) What constructs are needed in the CRM ontologically to refer to the unique content of a file.

B) What constructs are needed to refer unambiguously to a resource that changes content. This is modeled in CRMpem as "Volatile Dataset", and will not be discussed in this issue.

C) How to connect in a knowledge base to a materialized content description.

About A):

We take a file (see also Persistent Dataset in CRMpem) in the sense of an immaterial E73 Information Object as a unique sequence of symbols that can be machine-encoded, regardless what groups of bits constitute one of the symbols of interest in this object.
   in the KB: The intended identity can be represented by a URI.

We take a file in the sense of a material copy on a digital medium as a kind of "E24 Human-Made Feature", regardless whether it is on a local installation, in a "cloud" cluster of machines, a LOCKSS federation of copies, or on a removable carrier.
  
    in the KB: We may refer to the material copy by an external URL, or create an E52 String in a KB or within an RDF file, or use a platform-internal  "BLOB mechanism" with whatever kind of identifier the platform refers to the local copy.

Ontologically, it is irrelevant for the intended immaterial content if the copy is printed or scribbled on a paper or on a digital medium (or even a Morse sound track), as long as the material form  is unambiguous wrt to the intended content. Both, paper and digital media can have errors. The CIDOC CRM v7.1 can be printed on paper and in principle be reentered manually into a file loss-free.

   in the KB: We may refer to a paper copy or a removable medium by an archival identifier.

About C)

Using an archival identifier for a paper copy, a removable digital medium or a URL for a file on a machine, in all cases the maintainers of the archive must guarantee that the identifier will be uniquely connected with the content. Otherwise, using a URL in a KB is simply inadequate.
The DOI organisation forsees penalties for users that change the content of a URL associated with a DOI. There is no other solution.

DOI automatically redirects from the DOI URI to the guaranteed URL.

The property P190 has symbolic is used to connect a machine-encodable information object to a KB internal string. Similarly, we want to refer to the content of an information object via an external digital or not copy, via a URL or archival identifier. Therefore we propose the following property:

*New proposal:*

Pxxx has representative copy

Domain: E90 Symbolic Object

Range: E25 Human-Made Feature

Subproperty of: E90 Symbolic Object. P128i is carried by (carries): E18 Physical Thing

Quantification: many to many (0,n:0,n)

Scope note:  This property associates an instance of E90 Symbolic Object with a complete, identifying representation of its content in the form of a sufficiently readable instance of E25 Human-Made Feature, including, in particular, representations on electronic media, regardless whether they reside internally in clusters of electronic machines, such as in so-called cloud services, or on removable media.

This property only applies to instances of E73 Information Object that can completely be represented by discrete symbols, in contrast to analogue information. The representing object may be more specific than the symbolic level defining the identity condition of the represented. This depends on the type of the information object represented. For instance, if a text has type "Sequence of Modern Greek characters and punctuation marks", it may be represented in a formatted file with particular fonts on a particular machine, meaning however only the sequence of Greek letters. Any additional analogue elements contained in the representing object will not regarded to be part of the represented.

As another example, if the represented object has type "English words sequence", American English or British English spelling variants may be chosen to represent the English word "colour" without defining a different symbolic object.

In a knowledge base, typically, the represented object will appear as a URI without a corresponding file, whereas the representing one may appear by the URL of a binary encoded file existing outside the knowledge base proper, or by the archival identifier of a paper edition. A URL for identifying the copy itself in a knowledge base should only be used as long as the providers support the persistence of that copy under this URL, as it is current practice for "Linked Open Data". Associating the referred copy with a checksum in the knowledge base may help safeguarding the maintainers against unexpected change of content under this URL. If more than one representative copy is referred to, the maintainers should control their mutual consistency at the symbolic level of the object intended to be represented.

Examples: 

-----------------------------------------------------------------------------------------------------------------------------------------------------

Note that the MS Word copy AND the pdf copy of the CRM is regarded to be copies of an identical symbolic content, the one we are interested in!
vocabulary of symbolic levels is still to be defined!

IF an instance of E73 Information Object is referred to in a KB via a (persistent) URL, I would regard this as a compression of URI - Pxxx - URL. This practice would not allow for the distinction between bitwise identity or higher symbolic form.

Partners of this homework please comment if I have missed something!

Best,

Martin
 

Post by Daria Hookk (17 September 2023)

Dear all,

beeing out of discussion (too many reasons) I can add a metaphorical opinion. Poem about Gilgamesh is known after clay peaces, broken, copied and fixed with adds, and are able to read it only with interpreter. The same with any file - machine readable data, wich are fixed phisically and accessed only by computer interpretator and electricity. Be sure, most part of files even printed has no sence for human. Are you interected, where are cleaning lady's rag? Never! Although it's ethnography. Conclusion: files exist only for robots. Copies of video and sound as part of heritage are other thing presenting only specific part, but be sure, useless century years after creation. Maybe we need caracterize them by durance of use?

With kind regards,
Daria Hookk

Post by Martin Doerr (25 September 2023)

Dear Daria,

Thank you for this message! I think we all agree with your comment, but sadly, people seem to forget that more and more, being blinded by the flood of digitally available information.

When I wrote:
"Using an archival identifier for a paper copy, a removable digital medium or a URL for a file on a machine, in all cases the maintainers of the archive must guarantee that the identifier will be uniquely connected with the content. Otherwise, using a URL in a KB is simply inadequate."

I did not want to open this discussion. For 500 years the disciples of Buddha relied on oral tradition by many people as being more reliable than the written form.

Here an anlysis we did many years ago, with an analytical model:

Petraki, M. (2005). Evaluating the reliability of system configurations for long term digital preservation.  (pdf).

But true preservation needs continuous control if the words are still understood (as in oral tradition), .
This effort can be done only to a limited set of things, so we need a selection of what we want to remember. This requires an understanding of our cultures, do we have it?

Isn't it?

Best,

Martin

In the 57th CIDOC CRM & 50th FRBR/LRMoo SIG Meeting, the SIG was presented with a summary of the discussions that took place within  the group tasked to work on issue 490. The ensuing proposal was to introduce a property whereby to refer to the content of an information object via an external copy, via a URL or archival identifier. 

For the details of the proposal and a summary of the discussion see here

Decision
The SIG voted in favor of adding the property, but assigned CEO, MD, and SdS to redraft its scope note to read as an observation, and also to come up with a less controversial label. 
HW: MD, CEO, SdS

Marseille, October 2023

Proposal for a new property Pxx has complete copy (is complete copy of) by Martin Doerr --pc. 6 February 2024

Pxxx has complete copy (is complete copy of)

Domain:            E90 Symbolic Object

Range:              E25 Human-Made Feature

Subproperty of: E90 Symbolic Object. P128i is carried by (carries): E18 Physical Thing

Quantification:     many to many (0,n:0,n)

Scope note:     

This property associates an instance of E90 Symbolic Object with a complete, identifying representation of its content in the form of a sufficiently readable instance of E25 Human-Made Feature carrying it, and no other content fitting to the type of the represented object. This property only applies to instances of E90 Symbolic Object that can completely be represented by discrete symbols, in contrast to analogue information. 

The kind of representing feature may be paper, but also, in particular, electronic media, regardless whether they reside internally in clusters of electronic machines, such as in so-called cloud services, or on removable media. The observation of the feature, be it by direct eye-sight reading or via a mechanical device transforming its content into a form comprehensible by human senses, provides the ultimate access to the content of the represented Symbolic Object. 

Which kinds of symbols and their arrangement on the representing feature are to be regarded as the exact content of the represented Symbolic Object will depend on the symbolic type of the Symbolic Object represented, which should be document via the property P2 has type. The representing feature, being material, will typically be more specific than the symbolic form defining the identity condition for the represented. 

For instance, if a text has been declared as instance of E78 Information Object with P2 has type "Sequence of Modern Greek characters and punctuation marks", it may be represented in a formatted file with particular fonts on a particular machine, meaning however only the sequence of Greek letters. Any additional analogue elements contained in the representing object will not regarded to be part of the represented. As another example, if the represented object has type "English words sequence", American English or British English spelling variants may be chosen to represent the English word "colour" without defining a different symbolic object. 

Consequently, more than one instance of E90 Symbolic Object, but of different symbolic type, may be declared Pxxx has complete copy the same instance of E25 Human-Made Feature. Further, all material copies may undergo deterioration. Since the identity of the represented symbolic object depends on the material content of one or more copies, extant or not, it is advisable to have access to multiple copies, in order to sort out corrupted copies or parts of them in time. Otherwise, the last existing copy or fragments of it constitutes the last witness of its content.

In a knowledge base, typically, the represented object will appear as a URI without a corresponding file, whereas the representing one may appear by the URL of a binary encoded file existing outside the knowledge base proper, or by the archival identifier of a paper edition. A URL for identifying the copy itself in a knowledge base should only be used as long as the providers support the persistence of that copy under this URL, as it is current practice for "Linked Open Data". Associating the referred copy or copies with a checksum in the knowledge base may help safeguarding the maintainers against unexpected change of content under this URL.

Examples

In first-order logic:
 

 

In the 58th CIDOC CRM SIG Meeting, seeing that the sig had no time to review the HW that MD submitted it was decided that SIG members review the document and resolve it by evote.  
 

Paris, March 2024

Post by Eleni Tsouloucha (15 April 2024) --Call for an evote: 

Dear all, 

During our last meeting, we didn't get to review Martin's HW for issue 490 (see here). It was decided that Martin's proposal to introduce property Pxxx has complete copy (is complete copy of) would be put on an evote instead.

 

Please indicate your agreement or provide feedback by April 30. 

 

All the best,