Unibas

List of Tables

1. Infosource-related tables

Unibas

Unibas is a relational database application that tries to address recurring common problems in all database applications, such as displaying actual and consistent data, maintaining consistency in the underlying DB, etc.

One part of Unibas is a document and media management system. Any type of media can be handled, but it is most useful for digital documents (files) on electronic media.

Unibas is also a database site of public data about literature, (music) CDs, (video) DVDs etc.; Unibas is a who-is-who of famous persons, a dictionary and a thesaurus.

With Unibas you can manage literature, music, videos and other documents; persons and addresses; companies, i.e. commercial companies as well as clubs and other societies people can be members of; physical objects such as pieces of furniture, but also books and CD; and much more. These entities can be in mutual relationships: physical objects such as CDs can contain copies of documents; a document has an author - a person; a person has an address and can be member of a company; and so on.

Unibas tries to be convenient by displaying related data simultaneously: a company along with its members; a CD and the copies of documents on it, etc. This can mean displaying two, three or even more entities to show certain aspects and relationships. For what is sometimes more important than the data of the entities themselves, is their relations. Knowledge is not knowing facts, but knowing how to use them, draw conclusions, relate them to other facts, overview a network of relations.

UniBas tries not only to store data appriopriately, but also make them conveniently accessible to the user. A document, for example, can be shown (a text) or played back (a piece of music; but Unibas uses the word "show" in all these cases). UniBas can extract some data from a document such as its type and author. Words can be looked up in the dictionary, and translated to words in a different language.

UniBas means Universal Database. Universal has several aspects: UniBas handles many types of entities; UniBas has many usages. Words are stored to look up their definitions, to translate them into different languages, to help people learning foreign languages, and other usages that are not yet known at the time of this writing. UniBas doesn't try to target a special task but to best represent the data. With a good representation and good general functions, it is easy to implement new functionality.

Technically, Unibas stores data in entity and relationship tables. These notions are used mainly as in database theory. The main entities are:
document
object
product
person
room
company
word
notion
node
and other less important entities that are not left out in this introduction, but discussed in detail in the main part.

A document represents the "soft" aspect of a book, for example; the contents. An object represents the "hard" aspect, the paper it is printed on. The "soft" and "hard" part are linked via "cp" table entries. All copies of a book together are represented by a product. Note that Unibas deals with single items (objects) as well as generalizations (products). A product represents all data the objects, e.g. all copies of a book, have in common. These are: publisher, publishing year, EAN/UPC/ISBN number, etc.

Work with Unibas is most convenient if a document is stored in a digital file. A (paper) book is not a file, not even its "soft" aspect is a file. But a book can be scanned and stored e.g. in a PDF file. From a list of scanned pages, Unibas can create such a PDF file (document context menu - Archive images as PDF).

Unibas, the program, is a free universal database application. Unibas, the organization and its Internet site Unibas.org, develops the program Unibas and collects all free documents and data. It cooperates with other free-content organizations like Wikipedia and Project Gutenberg.

Unibas is written in Python. It currently has two user interfaces: unibas_gui.py, a PyQt application, and unibas_cli.py, a command-line application used mostly for automatizing tasks. A web interface, unibas_web.py, is planned but not yet realized.

Maybe there also will be a unibas_kde.py, a version written with the KDE library that integrates seamlessly in the KDE desktop, follows the KDE GUI design guidelines, comes with translations (unibas_gui.py has none currently) and so on. However, this depends on good KDE developers willing to code in Python instead of the usual C++ used for KDE. While a KDE/C++ fork is always possible (Unibas being free software), there will be no support from the Unibas founder for anything except Python.

When reading a (especially scientific/technical) book, you notice that in most cases, it is organized in sections, subsections and so on, each section usually explaining one bigger or smaller topic. So you have an hierarchical structure on the sections of a book. Now collect all books on the branch of a science, e.g. on operating systems; collect all those collections, and so on, and you can imagine that (at least theoretically), you can put all human knowledge into such a hierarchical tree. The UniBas thesaurus is an attempt to do this.

Out there in the Internet, people are building the world's largest international, peer-reviewed, free encyclopedia. They call it Nupedia, and they have relatively strict quality guidelines.

Another free encyclopedia is Wikipedia, a collaborative project to produce a complete encyclopedia from scratch. They started in January 2001 and already have many articles. With few exceptions, anyone can edit any article--copyedit, expand an article, write a little, write a lot.

The Unibas thesaurus tries to give a structure to such collections. Alone, it is empty like a skeleton, but it has quotations of text attached to it. These quotations can be read online, or an article or a book can be created from all quotations of a topic.

Unibas is also suited to collect documents: texts, images, sound and other multimedia files. The user should be able to type the name of a composer or artist, and see a list of all his works, then select one, and the work is played back instantly.

Building up such a collection can be a lot of work, but UniBas' task is to do as much of it automatically. At this point, it works, but it is not yet suited for the average user. You need to know program and data structure to use it.

But it's an interesting program, and you can expect more to come soon.

The Unibas logo is a stylized "U" where the upper part is the outline of a database table and the lower, round part contains a globe. For optical reasons (a table outline is hardly visible in a small icon), the letters "SQL" can be used instead of the table outline.

Unibas has nothing to do with unibas.ch, the site of the university of Basel, unibas.de or unibas.com, a Korean commercial site. Unibas has the international nonprofit site, unibas.org.

Core Unibas features

Unibas is currently alpha. When all the core functionality below is implemented and works reliably, we can call Unibas beta.

"Nice-to-have" wishes for the future are marked "(wish)". They need not be implemented for beta status.

Data definition.

Conforming to PostgreSQL recommendation, use sequence IDs (id SERIAL PRIMARY KEY) throughout in entity and relationship tables, not implicit oids.
Framework for consistency and plausibility checks

Universal binary relationships (rel).

Dialog to define new rel (from notion, optionally creating new notion)
Optional context menu point for InstanceWidget to edit rel
Relnotion table independent from notion table. Relid references relnotion table, not notion table. Only relnotion.notionid references notion table.

Development and release management.

SVN for Unibas development, with Sourceforge. See http://sourceforge.net/apps/trac/sourceforge/wiki/Subversion.
installation procedure, packaging for Debian/Ubuntu
management of needed packages and external programs
backup method for DB
keep history list, last ca. 100 visited ids of each entity (wish)

Document and archiving.

handling of mounts of media
metadata: for a set of supported formats, extract metadata
import bibliography data e.g. from bibtex (wish)
document playback using ext. program depending on mimetype
album form for album, original content and versions (in other formats)
in album form, creation of copy of original album (e.g. audio CD) from contents in other formats, e.g. MP3s
album form: handling of multivolume albums
syntax for definition of documents and data wanted by a Unibas site
dialog to select documents (by title, author, etc.)
automatic exchange of documents and data between sites
define criteria for documents to be kept on active medium (wish)
subdirs for albums etc.; one more hierarchy level for multivolume albums (wish)
treat subdirs as big files when copying and calculating size, but as otherwise as individual documents
find file.ext and file_1.ext and check for content similarity (in case two almost identical files have been archived)

Automatizing tasks.

check if streamripper is running
harvest streamripper's ripped files
archive files in toarchive
copy documents to external medium if one is available
check external medium (calc and compare md5sums)
delete files with enough copies from active medium
get data from infosources
exchange data with other sites
archive photos and videos from dvgrab
systematic logging of cron runs

Infosources.

methods to find changes/additions without downloading everything again
individual polling interval, last polling date
method to find out if format has changed
history of prices
check regularly and get new data (e.g. new DVD issues)
importers for several dictionaries, DVD vendors, etc.
gathering news
complete definition of categories e.g. sciences, art, cooking... from dmoz or other sources
all files from Project Gutenberg and similar projects from other countries
From Wikipedia:
- get all composers, and all works by composer
- all musicians, actors, directors (from imdb) and their works; double-check across infosources
- authors and their works; link with Project Gutenberg and other sites

Knowledge tree.

possibility to select entity type and entity, and insert link to entity in text; maybe ask if node should be created for selected entity; if so, then point the link to that node and create a rel record linking that node with the entity.
process links before converting to HTML etc.
editor functionality in quotation.text field
import text and structure of documentation documents, e.g. manpages
semi-wysiwyg: edited text and docbook side-by-side
text-to-table: 2 or more spaces indicate new column
more tools for generating docbook
pretty-printing docbook source
gendoc and HTML preview
aids for notion editing: mark one or more words, select notion wizard, and the wizard layouts the words as emphasized, takes them into a new notion, the whole sentence as explanation, maybe the words also as word table entry, enters the current quotation id in notion, and maybe also calls a form for editing this meaning (word-notion).

Person.

address form
form letters, campaigns
family tree
portrait photos
Import vcard and other formats into person, company and address tables

Products, object and rooms.

front/back/cover photos (useful for re-creating CDs and DVDs)
re-creating CDs/DVDs, incl. printing on CD/DVD and inlet, metadata on DVD (outside VIDEO_TS)
get products and prices from infosources, to analyse price development

Archive maintenance.

automatic warning if a medium or recording seems to get older than expected lifetime
semi-automatic recopying: system asks user to insert medium with endangered documents, then the system does the rest
obsoletion date for documents: deleting files and forgetting metadata if desired

Company and members/employees.

form displaying company and members
form displaying company and children (org. structure)
founding date, liquidation date
legal type (Ltd., etc.)
job, parent company
jobs terminated (begin-end)

Dictionary.

Notion form for editing a notion and corresponding words; for the words their existing (other) meanings are also displayed.
handle common grmmatic transformations, irregular verbs
more human-readable presentation of a rel
(semi-) automatic input of words and meanings from infosources such as wiktionary.org

Future features

Advanced features: With the core features, Unibas is useful, but not yet really interesting. The following features should give Unibas an edge.

- manage user's annotations to documents
- automatic detection of infosources, starting from search machines
- HTML interface (unibas_web)
- radio/TV broadcast analyses: These are files with information about
radio/TV broadcasts aired in the past. They will be kept at least a few weeks
and contain exact timestamps (if possible, frame-exact) for
the following events:
- announcement of the broadcast short before it begins
- exact begin of broadcast, as opposed to announced begin
- begin and end of commercials and other interruptions
- overlaps due to commercials and other interruptions
(after a commercial often a few seconds of the movie are repeated
that have been aired before the commercial)
- exact end
- position of broadcaster's logo
Such files can be created maybe with an extension of mplayer or avidemux.
They can be applied to "raw" radio/TV recordings, i.e. recordings
containing some extra minutes before the broadcast, all commercials etc.
in-between, and extra minutes after, in case the broadcast was delayed.
The result of the application of the analysis is the movie or song
with all unwanted parts removed.
In the analysis, all extra parts are labeled (announcement, commercial,
overlap, etc.) and the user decides what should be removed
(maybe a user wants to keep announcements etc.).
Optionally, broadcaster's logo can also be removed or replaced
with user's own logo.
- audio interface for blind people
- smartphone interface
- RFID tags to find things
- diary and personalized letters; diary as autobiography in thesaurus?

More ambitious features: not yet sure if they can be realized:
- automatic translation of quotations, assisted by tags
- drawing logical conclusions from information in quotations

UniBas features (unfinished thoughts)

- user management; guest being default user with low privileges
- secure HTML interface: login with user and password; encryption
- users have certain privileges:
- users may request documents; if available in archive and they have
sufficient privileges, they can download documents directly;
if not available, but sufficient privileges, documents are placed
in archive, and user gets notification he can download it now
- users can get documents on physical media: CD-R, Zip, floppy, paper.
Note that some licenses allow distribution electronically but not
on paper. Physical media must usually be paid in advance.
One of the privileges is that user gets limited amount of media for free.

- viewers, like textdocumentwidget, for other formats such as:
HTML, PDF, info, .doc or kword, ...
- print complete tif-series document

- relate node table with dmoz.org

- explanations, summaries, longer text in general, may contain
docbook tags and links
- links, like in quotation, also allowed in notion explanation;
not only to nodes, but also to person, word, etc.
- links have the same form as in quotations; it may change, but now
is the docbook form: text
with e.g. "person" or "word" for entity and the ID for 1234,
and some descriptive text

- word contains etymology explanation with links
- leader and follower pattern: word "Hans" is a name and a version of
word "Johannes"; detailed etymology is explained for "Johannes",
and for all other versions, it should be looked up from there
- for given names and most family names, there needn't be a notion:
if you say "Johannes", you don't mean "Jahve is great" or
whatever was the original meaning which can now be found in
the etymology; no, you mean a person. There may be a more or less
famous person with this name, and then there is a notion for him.
But just to have the name and its etymology in the DB, there
needn't be a notion.

- words: viewer can lookup word cursor is on
- grammatical info is represented in a "grammar" field as
csv-list; first value being part-of-speech (pos), meaning of
following values depends on pos
- with some more intelligence, viewer can guess meaning of word in this
context (gives complete context, not only word, to lookup)
- word and context can be extracted to DB, if not already in it,
creating word, notion, meaning, and maybe word usage example
- grammar: from declined form (and context), guesses dictionary form
- vocabulary trainer with personal vocabulary and strengths

- synchronization of sound/video and text -
show subtitled video or highlight text in sync with audio

- given a music CD, UniBas can extract all data:
- contents as MP3
- data from FreeDB, MusicBrainz or similar projects
- if composer/interprets etc. are known and recognized, creates links

- representation of CDDB data even without documents being available,
i.e. object (medium) prototypes (and document prototypes?)
- store booklet and cover images (as documents related to object prototypes?)
- able to reconstruct complete audio CD from MP3, with cover and
(optionally) booklet

- if archive is sufficiently full and there are enough documents
not saved to removable medium (CD), UniBas proposes to create one,
and if agreed, creates it; includes a readme file with CD contents
and a CD label; optionally jewel case cover and booklet with summaries

groups of instances, e.g. person groups for form letters, playlists from document groups etc.

In texts such as quotations, there can be references to IDs
(e.g. node IDs). It takes too long to search all texts for an ID
if it needs to be changed (in changeID).
Therefore, an index table is created and maintained.
It is updated every time a quotation is stored or changed.
table: refindex
fields:
refererid: ID of the quotation
entityid: ID of the entity (not EE), e.g. 9 for node, 5 for dcument
id: ID of the refered EE (e.g. node ID)

If other entities (and fields) can refer as well, then
we need also the fields:
refererentityid
refererfield

Unification
records themselves:
-text:
- concatenate
- take longer one
- take non-null one
- dates: take more accurate one
- selection:
- take non-null one
- foreign ID:
- if both given and realistic, manual decision
dependencies: update, set to remaining ID

Deletion (how far can Postgres help?)
1. produce list of dependencies
2. warn, ask
3. if OK, delete dependencies
4. delete EE itself

Universal attribute-value per entity

Unfinished idea.

CREATE TABLE at (  name   text, attrval text[][] );
insert into at values ( 'photo1', '{{"place","krummenau"},{"person","julian"}}');
insert into at values ( 'photo2', '{{"place","paulshof"},{"person","annika"}}');
select attrval[2][1], attrval[2][2], * from at where attrval[2][1]='person' and attrval[2][2]='julian';

Ideas for universal attribute/value per entity:
Table "attribute", fields:
entityid  id1 id2   Attribute   value  
example:
entityid=5 for document
id=... for desired document(s), e.g. photos (all >=id1 and <=id2)
attribute='place'
value='Paris'

Note that for standard attributes, a standard solution
should be found, so that user needn't invent his own.

Installation

Unibas uses a lot of external programs. To install these and Unibas itself, you need root privileges. Do as root:

# apt-get install python-setuptools

This will install the python-setuptools which are needed to install Python eggs. Then install the Unibas egg using easy_install from python-setuptools:

# easy_install dist/Unibas*.egg

After that, Unibas is installed but not yet initialized. To do that, call:

# python unibas_init.py

if unibas_init.py was installed in the path by easy_install. The output of easy_install tells you where the Unibas start scripts were copied. unibas_init.py loads required packages for Unibas. You may have to insert a CD/DVD or otherwise make the packages available, for example via network. See the documentation of your distribution (Unibas was tested on Debian/Ubuntu) for more about package management and installation. unibas_init.py just uses the distibution's package management (apt-get).

unibas_init.py also installs and tries to configure PostgreSQL. If it was successful, you should immediately have access to the PostgreSQL server, e.g. using: psql template1 but this test is not necessary.

Now you should be able to start Unibas like this: unibas_gui.py At first, Unibas will report that there are no valid settings. As usual, you are safe with the defaults, but you can enter your own settings, if you know what you do. So Unibas will ask you to enter settings. Press OK to get to the configuration dialog. In every configuration tab (archive, database, logging), it is recommended to first click on Defaults, then modify or leave the defaults. Click OK to accept. If the DB connection you set up does not exist yet, unibas proposes to create a DB. Similar with the archive: If there is no archive directory, select one to create, e.g. /home/yourname/unibas-archive/ Unibas creates the directory for you. After settings are completed, the main unibas screen appears.

Add the option --record /tmp/rec to easy_install in order to store the messages at installation into the file /tmp/rec. Needs package python-setuptools.

Unibas needs a lot of other programs to run. We try to fulfill these requirements automatically, but this may fail sometimes. Users with Linux experience can, however, solve any problems with the following instructions.

Unibas needs the following other Ubuntu packages:

python-kde4
python-qt4
libqt4-sql-psql
python-qt4-sql
python-qt4-common
python-qt4
python-setuptools: to install python "eggs", if package is an egg
python-eyed3: for ID3 tags of MP3 files
python-beautifulsoup: to read HTML
postgresql-client-8.3: psql, recommended for debugging
postgresql-plpython-8.3
postgresql-client-common
dvgrab: to grab from digital video cameras
dvdbackup: to backup DVDs from DVD wizard

External module to install in /usr/local/lib/python2.5/site-packages/:

rssparser.py

Also, Unibas needs a PostgreSQL server. This can be on the local machine or on a different machine. The user - let's call him unibasuser - needs appropriate rights on that server. Usually these rights are granted by doing as postgres user on the server machine:

createuser unibasuser

and it is useful to let the new role be a superuser.

To access the DB server from the client, add to /etc/postgresql/8.3/main/pg_hba.conf a line such as:

host    all         volker         127.0.0.0/16          trust

and in /etc/postgresql/8.3/main/postgresql.conf change the listen_addresses line to:

listen_addresses = '*'          # what IP address(es) to listen on;

Restart the database server to make these modifications effective:

/etc/init.d/postgresql-8.3 restart

The Unibas installation procedure will try to make these settings automatically for you if the database server is the local machine. Otherwise, you will have to do them manually on the remote server.

Note

When testing the database access, keep in mind that Unibas uses the "host" connection type, not the "local" one, even if the server happens to be on the local host. That means, if you test the connection with psql, use: psql -h unibashost unibas and not just psql unibas For "local" connections, a pg_hba.conf line beginning with "local" applies; for "host" connections, a line beginning with "host" applies. You might be able to connect with the "local" type, but not with the "host" type. So you might end up wondering why Unibas can't connect while you (with psql using the "local" connection type) can.

Tutorials

These are some tutorials about how to easily use Unibas. They describe everyday actions made easy by Unibas, mostly in a "howto" manner.

How to archive all photos of an event

Often one makes a trip and shoots a lot of photos. These photos then need to be archived. The trip can be told an event (with location, start and end time) and the photos can be associated with this event.

To do so, select Form->Document Query Form, then in the form that opens: File->New documents from files. Select the photo files (you can select more than one). Now all selected photos should be archived, but no event is yet associated with them. If the corresponding event does not exist yet or has no date yet, copy the date of the first photo (to paste it later).

In the list, select the first photo and in the document widget, select Rel editor from the context menu. In the Rel editor dialog, click on New to create a new rel. As rel notion, select "show". As right-side entity type, select "event". Then click Find to find or create the event. In the "Find event" dialog, in the search expression line, type a '+' followed by a keyword that should occur in the event's name. Then click Find. If nothing is found or none of the found events is the right one, you must create a new one. To create a new event, click Add new. If you had typed +keyword as recommended before, that keyword will be provided as default for the name of the event. Take it as it is or edit it (of course, you can edit it later again). Select the new event and click OK. Back in the rel editor dialog, you can edit the new event by clicking Show. For example, you can paste the date of the first photo in the starttime field.

Don't forget to save the new rel by clicking Save. Then Quit. Now only the first photo has the event associated with it. To avoid going through this procedure with all other photos, there is a function in the context menu of the list: Propagate rel to rest of list. With the first photo in the list selected, call this function. In the event selector that opens, click OK. Immediately the other photos will be associated with the event, too. In the display nothing will have changed. To see the change, select a different photo and call Rel editor for it.

For more details see The Document Query Form.

How to archive an audio CD with Album Form

A typical task for the Album Form is to archive an audio CD. This describes the archiving of a single CD, for packages of two or more CDs see below.

Insert the CD in the drive. Start Unibas, Form->Album Form, then in the Album Form that opens: File->Check audio CD as you already know the type of CD.

Unibas tries to identify the CD. Let's assume here it is new, i.e. has no product and object entry yet. So in the Find Product dialog, Unibas will propose (by making default) the button Add new.

Hit Enter to activate it (which is the same as clicking Add new).
You will be asked for the title. If a reasonable default is proposed, accept it, otherwise edit it. Hit Enter.
Now you have created a new product, selected by default, and the OK button is marked default. Hit Enter again (which is the same as clicking OK).

Now we have a product associated with this CD. Same procedure for the object:

As it is a new object, hit Enter to add a new entry.
Hit Enter to accept the name (or edit it before).
Hit Enter to accept.

Copies will be scheduled in the default format (e.g. MP3). Confirm their execution.

Ripping and conversion to the target format takes some time. After that, you can right-click on a track and select Show digital version to check the copy.

Then Tools->Eject and clear to finish this CD and open the drive for the next.

This short Howto assumed that everything went well and the album title and track titles could be read from the FreeDB. If this is not the case or other complications arise, see The Album Form.

How to archive a pack of audio CDs with Album Form

After you know how to archive a single CD, this tells you how to archive a pack of audio CDs.

First we register the package as a whole. Again, select Form->Album Form, then in the Album Form that opens: File->Add package. Enter title and the other data. Product form will be opened. Check data, edit if necessary, and close it.

That's for the package around the CDs. Now Unibas knows that you want to enter a number of CDs. Insert CD 1 in drive. File->Check audio CD. Unibas supposes this is CD 1. Confirm that. The following is the same as with single CDs:

In product finder: Enter to Add New, Enter to confirm title, Enter on OK.
In object finder: Enter to Add New, Enter to confirm title, Enter on OK.
Then hit Enter to start copies.

Afterwards you can check that this product has the right packageid (select the Product tab).

Then Tools->Eject and clear to finish this CD and open the drive for the next. The next one is expected to be CD no. 2 by Unibas, and so on until all CDs of this package are finished. If you enter another CD after that, Unibas expects it to be a single CD again.

Conventions used in UniBas documentation

Specifications about the tables' structures are written in nodes called "Specifications for the tablename table". Fields are given in an ordered list.

Terminology in Unibas

In old program comments sometimes the terminology is not consistent, but in the future we try to stick to the following:

entity: An individual, distinguishable object such as a person or an object. Sometimes also called entity instance.
entity type: Type or class of similar entities. Sometimes also simply called entity, but that's incorrect strictly speaking. Stored in an entity table.

Unibas tables and the data they hold

This section details the data in Unibas. As usual with documentation, this can be outdated. Comments in the program and the database may be more up-to-date. However, there should be no fundamental contradiction between the program resp. database and this documentation. There should be only differences in detail, if any. For example, the fact that a document is part of another may be represented by an entry in the rel table, or by a special field "partof" in the document table, that holds the id of the parent document. As long as this is not finally decided, the relation between part and parent can be read from a view, e.g.: select id,id2 as parentid from document, rel where rel.id1=id and rel.ent1=5 and rel.ent2=5 and relid=44153; or select id,partof as parentid from document where partof is not null; Data about the tables and their relationships are listed in metadata tables. For example, comments and other metadata about the columns are listed in the column_metadata table. Metadata tables are very important technical tables. Think twice if you change anything in these tables. Many parts of the program read these tables and changes here can cause malfunctions.

Metadata tables and views

Metadata tables and views hold data about the data, i.e. metadata.

Entity tables are listed in the "entity" table. Comments and other metadata about the columns are listed in the column_metadata table. The tablelinks table holds data about relationships between the tables.

The tab_col view lists tables and their columns. It gets these data from Postgres system tables, in contrast to the column_metadata table which holds its data separately.

column_metadata.lines is the number of lines in a QTextWidget recommended for the field. It is usually 1, but for fields like document.summary and node.text it is more. column_metadata.comment is a human-readable comment useful as tool tip text.

tablelinks holds info about foreign keys. Maybe that info can also be taken from pga_ tables, but here it is explicit.

Tablelinks has the fields mastertable, subtable, subfield and an entry in tablelinks has the meaning: Field subfield in table subtable refers to field id in table mastertable.

If in a table t1, there is a field named t2id, and there is another entity table named t2, then usually t2's id is a foreign key for the field t2id. Most other fields ending with id also refer to foreign keys.

This SQL statement finds fields possibly referring to foreign keys: SELECT * FROM (SELECT *, position('id' IN col) as pos FROM tab_col) AS r LEFT OUTER JOIN tablelinks ON subfield=col WHERE pos>1 AND subfield IS NULL; TODO: A program checks if there exists a tablelinks entry with mastertable=t2 and subtable=t1 and subfield=t2id. If not, it suggests to create such an entry and if the user accepts, it creates the entry.

OLD: Often it is necessary to find out from where an entity element, e.g. a node, is referenced, e.g. if you want to delete it. For this, Unibas needs to know in which tables there are fields that refer to a given field in a given table. This information is explicit in meta_link and the view meta_table which uses Postgres system tables (hence it is highly Postgres specific):

select r.relname, f.attname, f.attnum, f.atttypid, t.typname from pg_type t, pg_class r, pg_attribute f, pg_namespace nsp where r.oid=f.attrelid and f.attnum>0 and f.atttypid=t.oid and not f.attisdropped and nsp.oid=r.relnamespace and nsp.nspname='public' order by f.attnum;

The entity table

The "entity" table is a very important technical table. It lists all entities and thei properties. New entities should be listed here, but think twice if you change anything about the existing tables. Many parts of the program read this table and its change can cause malfunctions.

The "entity" table contains:

id: entity id name: unique entity name, also programmatically used abbr: abbreviation if name is too long, hardly used descr: detailed description of entity, possibly multiline, not machine-readable tostring: expression that yields a short description, human-readable, as precise as possible, but only one line, without consulting related data from other tables detail: expression that yields a longer description, human-readable, detailed, possibly multi-line, if necessary also by consulting related data from other tables mainfield: name of a single field that best characterizes the instance, e.g. "famname" for person, "descr" for object shared, min_id and max_id are obsolete; they were used experimentally for inter-site communication but will be replaced</para>

Some entities can have a public id field. Example: the ean field of the product table. This is a field, usually of type text, which is accepted universally (across databases) as an id of a given instance. For persons you can think of a social security number, for places of a postal code combined with country code (or the geograpical latitude/longitude?), for mimetypes the type field, for companies some sort of fiscal code, etc.

This field's name is in stored in the entity table's publicid field.

Persons

Whenever in Unibas an author, composer, actor, director etc. needs to be given, it is an instance of the person table. A person is not given by name but by person ID. On music CDs, vinyl disc labels or in MP3 files, a composer may be given as "Bach", "J S Bach", "Johann Sebastian Bach" or whatever. It is up to the reader to figure out that they mean the same. Especially annoying is the name "Bach" given alone. There are many composers named Bach, and it is usually expected that the user knows that with "Bach" alone, the most famous Bach, namely Johann Sebastian Bach, is meant. it Unibas tries hard to eliminate such ambiguousnesses.

Persons as family members are linked together by the marriage table. It has, among others, the fields husbandid and wifeid. Hence it strongly looks like a relationship. However, it also has (and needs) an id so children's records can refer to the marriage record of their parents. This makes it more look like an entity table. Peter Pin-Shan Chen, a relational database expert, writes in "The Entity-Relationship Model - Toward a Unified View of Data": "It is possible that some people may view something (e.g. marriage) as an entity while other people may view it as a relationship. We think that this is a decision which has to be made by the enterprise administrator. He should define what are entities and what are relationships so that the distinction is suitable for his environment."

Specifications for person table

marriageid (integer): ID of marriage record of parents
givname (text): given name(s), incl. middle initial if there is, but without nobility particle
famname (text): family name(s), used to find person in dictionaries etc., so without nobility particle, e.g. "Beethoven" and not "van Beethoven"
fm (varchar 2):
sex (varchar 1): "m" for male, "f" for female
language (text): mother language, e.g. "en" for English
title (text): title like "Ph.D.", to be put before the name
remark (text): for human use
born (date): date of birth
died (date): date of death; NULL means unknown, 'infinity'::timestamp means still alive
marriageid (integer): ID of marriage record of parents
marriageid (integer): ID of marriage record of parents
lan (text, not null): language code, e.g. "en" for English, "de" for German,

To select all persons that died since 1970:

select givname, famname from person where died>'1970-1-1'::timestamp and died < 'infinity'::timestamp;

To select all persons that are still alive:

select givname, famname from person where died = 'infinity'::timestamp;

update person set died = 'infinity'::timestamp where died is null and

Fictional characters

A character is "any person, persona, identity, or entity that exists in a work of art." (Wikipedia). A character may occur in several works (for example Tarzan in the book "Tarzan", several films of it, ...). If it is the same character, it has the same DB entry (and hence the same id) in Unibas.

Distinguish character from real persons. A character may correspond to a real person. For example, John F. Kennedy is a character in the 1983 film "Kennedy" but also a real person. In this case, the personid field contains the id of the real person.

Objects, products and rooms

Objects in Unibas are physical objects that are kept somewhere in rooms.

With industrial mass production producing nearly identical objects, the common properties of these are stored in a record of the product table.

With the object and room tables, you can record where you keep your stuff. Hence they are mainly for personal use. The product table, however, is of interest to many people, all who bought an object of that product or are considering to do so.

Physical objects

With the object and room tables, you can record where you keep your stuff, and what things you have.

An object is kept in the room given in its roomid field. A room is a compartment of an object, given in the room table's objectid field. This way you can build a hierarchy of objects and rooms.

The f_rooms(objectid) function lists the complete hierarchy of rooms and objects the given object is in. id roomid: room this object is regularly stored in ot: object type (abbreviations, see below) descr symbol productid orig for media (MCs, CDs, DVDs): true if original (prerecorded) dt dimension type, e.g. 'CD' for CDs, DVDs and BluRays purchase date of purchase (not of production and not of recording The purchase date allows to track the ages of media. It is not the date of production (which is earlier, but unknown) and not the date of recording (which is, in case of recordable media, later, and each document's recording date may be different). Recording dates are stored in cp.date. ot abbreviations:

VC	VHS video cassette
BK	book
FD	floppy disk
Kr	cardboard box
Dia	slide
MC	music cassette
Rg	shelf
Or	file (of paper documents)
Mö	piece of furniture
DVD	DVD (usually pre-recorded)
Bx	box
Cb	cable
CD	CD, usually pre-recorded audio CD
CD-ROM	pre-recorded data/software CD, sometimes simply called CD
CD-R	once recordable CD
CD-RW	rewritable CD
Dc	(floppy) disk cluster
DP 	(floppy) disk pack
DVD	prerecorded video DVD
DVD-R
DVD+R
DVD-RW
DVD+RW
DVD-ROM		prerecorded data DVD, e.g. Knoppix
FDD	floppy disk drive
Fr	bicycle
HD	harddisk (internal or external, e.g. USB)
Mn 	manual
Od	old paper document
PC	complete PC
Q8	QIC 8 data backup tape (old)
V8	analog video cassette
VCD	CD (-R or -RW) with video in VCD format
ZIP	IOMega Zip disk
Zs	paper magazine

Specifications for object table

Some objects expire after a given date, i.e. they become unusable or at least it is not sure that they will still work after that date. This date usually is not know exactly. However, th purchase date is known and the life span of an object type is known approximatively, so the expiration date can be calculated. For example, self-burned CDs and DVDs can be used 3 to 5 years, after that they become unreliable. For media this means: The copies of all documents on expired media should not be relied upon.

id: ID
roomid: ID of room this object is or belongs in
ot: type
descr: description
symbol: for real objects: represents object in library uniquely
productid: ID of product this is a copy of, whether original mass-produced or handmade; if handmade, the field "orig" should be set to false
orig: true if this is an original, factory-produced copy
dt: dimension type: this object may be stored in rooms of this dt
purchase: date of purchase
virtual: true if we only know (e.g. from freedb.org) that objects of this type exist; we have no real copy, and hence the roomid field is irrelevant; makes sense for copies of products

(TODO) Currently the rt field is proposed only, not implemented. Values the ot field can contain:

bk: book
CD: Compact Disk, where exact type is unknown, may be CD-DA, CD-ROM, CD-R, etc.
CD-DA: Prerecorded audio CD (officially Compact Disc Digital Audio)
CD-ROM: pre-pressed read-only compact disc
CD-RW: Compact Disc ReWritable (CD-RW) rewritable CD
DVD: Digital Versatile Disc, where exact subtype is unknown, may be DVD-Video, DVD+/-R[W]
DVD-Video: pre-pressed consumer video DVD
DVD-R: single-layer recordable (write-once) DVD, 4.7 GB
DVD+R: single-layer write-once DVD, 4.7 GB, It is similar to, but incompatible with, DVD-R
DVD-R DL: dual layer recordable (write-once) DVD, also called DVD-R9, 8.5 GB
DVD-RW: rewritable optical disc with equal storage capacity to a DVD-R, 4.7 GB
DVD+RW: rewritable optical disc similar to, but incompatible with, DVD-RW
DVD-ROM: pre-pressed read-only data DVD
DVD-RAM: DVD–Random Access Memory, rewritable DVD format with built-in error control and a defect management, better suited for frequent rewriting than DVD+-RW
Un-CD: Audio CD with copy protection making it incompatible with the CD-DA standard
Un-DVD: Video DVD with copy protection like "X Protect" making it unplayable in many DVD players

For Un-CD: See http://www.sprut.de/misc/uncd/uncd.htm For Un-DVD: See http://forum.cinefacts.de/178877-bestaetigte-dvds-mit-x-protect.html

For prerecorded media, orig=true; for copied media, orig=false. Identical or almost identical copies of prerecorded media can be marked with a "copy" rel entry (object is copy of object). For an 1:1 copy of a prerecorded audio CD, ot='CD-R' or 'CD-RW', orig=false.

Objects of type CD, CD-R, CD-RW, DVD, DVD-R, DVD+R, DVD-RW, DVD+RW etc. all have the same physical dimensions and therefore have te same "dimension type" dt='CD'.

Object vs. product

Object table entries represent physical objects. Two copies of the same CD are two different objects. Object entries are bad for representing mass-produced CDs, where only a type of nearly identical physical objects is wanted. There is also a local/global difference: object entries are for objects I know they exist, and ideally I can give each object a number. So I must have physical access to it. For products, I only need to have access to one copy in order to know all about every other copy. So I can share information with others and get info from others. (E.g. freedb.org).

If only the product is known and no concrete object of it, a virtual object can be created. This is an object entry with the virtual field set to true.

Albums

Album form is the best means for managing albums like audio CDs and video DVDs. It provides overview of the contents of an album and its copies. While the most interesting album type is the audio CD, other types also qualify, if they contain several analog or semi-digital documents. These types include:
Video DVDs
Music cassettes (MCs)
LPs
Singles (ot='SP')
VHS cassettes (ot='VC')
With semi-digital we mean that the contents are digital but not organized in a file system and hence cannot be read by normal file operations, or something else (e.g. CSS for DVDs) prevents normal reading, so a ripping process must be used.

Operations include:
Ripping (i.e., this is a frontend of a ripper call)
Tagging (i.e., this is a frontend of a id3 tagger call)
Showing and manipulating relations (version of, part of, artist etc.)
Retrieving information from infosources e.g. FreeDB, MusicBrainz etc.
Re-creating an album from the DB and (e.g. MP3) documents
Playback of audio and video documents for verification

For the database, an album is an object (medium) with copies of (maybe analog or semi-digital) documents. Of these documents, there may be fully digital versions in several supported formats.

For example, the album can be an audio CD, the documents are the tracks, and the digital versions of the tracks are MP3 files. Other audio formats are Ogg, Wav and flac.

The form displays a menu bar, object details and a tabbed window. The user can select the following tabs:
Documents
Product
Owner
Details
The documents tab shows all documents (for audio CDs: tracks) of the album. Each line is a document. For each supported format, there is a corresponding column. The column is checked if there is a copy of the document in this format.


Supported video formats are:
DVD ISO
xvid

Book formats:
scanned tif
scanned jpg
scanned pdf
OCRed text
proofread text


Albums

UBAlbumForm usage:
in object: add new
file->rip cd audio; start
archive ripped
restart
read cddb
mp3tag


front cover is an image (e.g. mimetypeid 57, jpg),
separate document, related by notion 63814 to main document.
back cover: 63815
in album form, additional tab for front and back cover images

images related to album object or product, or to 
document (especially in single-document albums such as DVDs)
images used in mythTV

Creating audio CD from MP files

Creating audio CD from MP3 files

# cd to dir with mp3s named 01*.mp3, 02*.mp3 etc.
mkdir CD
for i in `ls -1 *.mp3 | cut -c1-2`; do track=`ls $i*.mp3`; wav=${i}.wav; echo $i $track $wav; mpg321 -w CD/$wav "$track"; done
cd CD
wodim -v -pad speed=1 gracetime=2 dev=/dev/dvd -audio *.wav
cd ..
cd ..


# concatenated:
mkdir CD; for i in `ls -1 *.mp3 | cut -c1-2`; do track=`ls $i*.mp3`; wav=${i}.wav; echo $i $track $wav; mpg321 -w CD/$wav "$track"; done; cd CD; wodim -v -pad speed=2 gracetime=2 dev=/dev/dvd -audio *.wav; eject; cd ..; cd ..


Error: Svald:
wodim: Inappropriate audio coding in '03.wav'.
Error: Grace Valhalla:
wodim: Inappropriate audio coding in '05.wav'.
Error: TAT:
wodim: Inappropriate audio coding in '09.wav'.

Using AlbumForm

Using AlbumForm to archive an audio CD:

Call Form - Album form
Insert CD

rippedFilesArchiver

rippedFilesArchiver (obsolete?)

This class wraps the functionality of archiving, 
i.e. integrating into the DB, files ripped using UBCDRipperDialog
or some external program. Using external programs might be necessary
if their function is not easy to integrate into Unibas,
as in the Windows-only program DVDShrink.

rippedFilesArchiver generalizes the function archiveRippedTracks
which was used to archive tracks of a CD-Audio only.
In archiveRippedTracks, some things were hard-wired which 
are now variable and under control of the user:

The type (format, mimetype) was MP3. In rippedFilesArchiver,
many supported types are possible.

The filename was tracknn.mp3, where 01 < nn < number of tracks.

The documents were created upon archiving, with the titles
named "Track nn". Now it should be possible to 
name documents after the album's name, or to add copies (cp entries)
to preexisting document DB entries. These could come from external
databases such as FreeDB.

Of course, rippedFilesArchiver also features user feedback:
overall progress, copy progress (using rsync --progress), 
md5sum progress (using md5deep).

rippedFilesArchiver tries to provide as complete data
as possible. So it tries to determine e.g. playlengths from the ripped files.

rippedFilesArchiver tries to estimate the total execution time.
If something goes wrong, the state will be as before;
files copied till the error happened, will be deleted and
the DB makes a rollback.

The user can decide whether files are copied or moved (i.e. removed from their
original places, if all went well).

The order of the files can be set up: alphabetical or numeric,
whole filename or some part of it.

rippedFilesArchiver was designed with all album types in mind, not only 
audio CDs. 

Sometimes document titles can be derived from filenames.
A regexp can be used to extract the title from the filename.
For example, a jamendo.com mp3 filename might look like:
10 - In Your Diary.mp3

Also document for whole?
E.g. Bibi und Tina (consists of 2 titles): 
One DVDShrinked ISO for the whole,
and one xvid file for each title.

Possibility do define files only for subset

Rooms

A room is a physical space to store objects.

Some rooms, e.g. CD cases, hold objects of a specific type, e.g. CDs and DVDs, and have a given capacity of such objects they can hold. This type is stored in the room.dt field (for dimension type). Note that dt='CD' means that the room can hold CDs, DVDs, HD-DVDs and Blu-ray discs, as they all have the same dimensions. The dt field refers to the physical dimensions only, not the storage capacity of the medium.

The capacity of a room (how many media it can hold) is stored in room.cap. If you want to put a given object in a room, the object.dt value should be equal to the room.dt value.

If the roomid information of all objects is accurate, Unibas can tell for a given room if there are still vacant places in it, and can suggest where to put new objects. The view v_room_load counts for every room the objects it holds (only the objects whose dt equals the room's dt). From this, the view v_nextroom finds out which rooms have vacant places, and suggests the first of these rooms to put the next object. E.g. if you have a DVD to put somewhere, call:

select min from v_nextroom where dt='CD';

This is how the CD/DVD burner tool creates its suggestion.

To quickly find an object in a room that holds many of them, it is useful to order them in the room by the descr or id field. Then you can just call the SQL statement: select id,descr from object where roomid=1117 order by id; and quickly check that every object is in its place.

Specifications for room table

id: ID
descr: description
dt: dimension type: tells what dimensions objects must have if they are to be stored in this room; for example, 'CD' means this rooms stores only CDs and objects of the same dimensions, i.e. DVDs, HD-DVDs, Blurays etc.
cap: capacity of this room, in objects of dimension type dt
objectid: physical object this room is in

Product

A product is a (maybe compound) object that is produced in large amounts and sold. That is, the description of a product is interesting to all people who have an object of it.

The product table stores only fields that make a product out of an object (or several objects). The properties that the object has (no matter whether it is mass-produced or not) are still stored in the object table.

So what if we know a product but not a concrete object? For such cases, e.g. freedb.org CDs, we have virtual objects, where object.virtual is true. For such an object it is an inconsistency error to specify a roomid or an owner table entry.

The product table is also usable for price comparison, to exactly identify a product.

There is no document info in product entry. Also, no artist info (except in descr, not computer-readable).

If documents are stored on products (e.g. CDs, DVDs), we use objects (real or virtual) and copies (cp records) to keep this information.

If several CDs are sold in a package product, this package gets an ID of its own (see specifications). The parts refer to the package with their packageid field, and usually have no EAN of their own (but not always, they may also additionally be sold separately).

A product entry of a single disc refers to the disc, not the package. The package around a single disc usually deserves no entry of its own, unless it contains something special, like a book(let) or additional discs.

All objects of the same product are supposed to have (implicitly or explicitly) the same content, at least the same main content. That means, an original video DVD may have some extras that may be lacking in shrinked copies.

Relations that describe products:

front cover image (relid 63814): left side is cover image document, right side is product
back cover image (relid 63815): left side is cover image document, right side is product
show (relid 53225): left side is image document (of the product itself), right side is product

Specifications for product table

Specifications.

id: ID
descr: description
companyid: ID of producer (company that really produces the product). If the product is distributed by a different company, that company's ID is stored in distributorid.
ean: EAN or UPC of the product
discid: Identifier that can be read directly from the product (e.g. CD/DVD), and may be unique or not; for audio CDs, this is the FreeDB discid; for video DVDs and other discs with filesystems, this is calculated as an md5sum over a listing of all directories, files and their lengths in the filesystem
ot: type
artno: Number given for the product by its producer or distributor
rt: recording type: NULL=no info, ''=empty, 'Audio'=audio, 'Video'=video (e.g. DVD, VCD), 'Data'=data (ISO9660)
distributorid: If Production Company and distributor differ, this contains the ID of the distribution company. For example, according to IMDB (http://www.imdb.com/title/tt0413267/companycredits), Shrek the Third was produced mainly by DreamWorks Animation and distributed (among others) by Paramount. Other companies and persons must be dealt with by rel entries.
volumelabel: for audio CDs, this is empty as audio CDs have no volume label (AFAIK); for video DVDs, this is the volume label as returned by dvdbackup or volname; this is not unique, for example, DVD series often have one volumelabel for the whole series. and there is a default label, "DVDVOLUME", that very many DVDs have.
releasedate: the date the product was first sold; in case only the year is known, assume middle of the year, i.e. July 1.
packageid: ID of a product this is part of, for example of a 2-DVD package. The single disc may nevertheless have an EAN of its own, but usually it hasn't. Selling a single disc may or may not make sense. For example, the three discs of Ritmo Latino, a collection of latin instrumental pop music by different composers, may well be sold separately. On the other hand, if a story is too long to fit on one disc and has to be separated into two, these two can not reasonably be sold separately.
isxn: product-type specific identifier such as ISBN or ISSN

Lend

Deprecated?

personid: person ID of borrower (a friend etc.)
companyid: company ID of borrower (e.g. a library)
fromdate
tilldate
lent: This boolean field is used to store that and object still is lent and needs to be returned soon (true). A value of false means the object has been returned already (at date tilldate).
objectid: ID of borrowed object. The owner of the object is stored in the owner table.

The owner table

personid: person ID of owner (a friend etc.)
companyid: company ID of owner (e.g. a library)
objectid: ID of object this company/person owns
mediaid: Number given by the owner. Corresponds to the object ID in Unibas. Every owner has their own numbering scheme.
location: Code to record where to put the object, e.g. in libraries. Roughly corresponds to the room ID in Unibas.

In case both fields personid and companyid are non-Null, it means: owner is this person as a representative of this company.

Documents and archiving

Documents in UniBas speak are packages of information, content, no matter what sort, for example texts, music, video, computer software, games, etc. The word derives from the "documents" of office suites, which may be text with additional pictures and other inclusions.

There is a reason to allow documents without copies of them: often there are several versions of a document and we need a canonical version of the document to refer to. For example, there could be several versions of a composition by a classical composer (MIDI, MP3, sheet music in PDF, etc.) and only for the canonical version, we store the composer. For the other versions, we derive the composer from the canonical version (the version that is not on the left side of a "version" relation).

Archiving functionality of Unibas includes:

storing metadata (titles, authors, composers, ...)
backing up (Unibas acts as a frontend to tools like dvdbackup)
refreshing backed up documents
re-creating copies in original format, if possible

We use the term "to archive" in the meaning: ensure that documents are available and can be found now and in the distant future. The task of finding documents needs the metadata mentioned above. Keeping the documents available and readable in the future implies re-copying them from time to time, as media have a limited life span. That also means: you can't simply put prerecorded CDs or DVDs in the shelf and expect them to be there and readable all the time. Producers usually do not want their CDs or DVDs to be copied, but for archiving it is necessary to copy them, even if it is illegal in some states.

Separating software and hardware

The common term "book" has two aspects:
soft: written work, e.g. a novel
hard: the data medium, i.e. the paper the novel is printed on
You can store the "soft" book, i.e. the contents, also as a file (e.g. as simple "plain vanilla" ASCII text file, like many Project Gutenberg Etexts) on floppy disks, CD-ROMs or hard-disks. The UniBas term document means only this "soft" aspect, with no regard of the storage medium.

For UniBas, a "document" is primarily a record in the document table. It contains data (title, type, etc.) to identify the novel, movie, concert, etc. As you see from this list, documents can (and have to) be further sub-divided, and that technically and with regard to the contents.

As documents are completely "spiritual", they can't exist without "materializing" them. The contents of documents are stored on media such as paper books, video and audio cassettes, CDs and other digital media. A medium is a physical object with the ability to store information. A copy (record of the cp table) links a document to a medium it is stored on.

For UniBas, analog media such as books, video and audio cassettes, and digital media without filesystems, such as audio CDs, are not so interesting. If the user needs some information stored on a paper book, UniBas can only tell him where the book is. With digitally stored information, UniBas can also fetch the document and present exactly the desired part of it to the user.

Examples of what the program should be able to handle

UniBas is supposed to handle:
Books (monographies)
Book series of constant size (e.g. the "Lord of the Rings" trilogy)
Contents of Music CDs: Songs, Groups, Symphonies and their parts, operas (of which parts, e.g. ouvertures and arias, often are isolated)
Books of songs and music notes; single songs (melodies and text)
Scanned books (and parts) in graphics form; OCR'd scanned documents (not yet proofread, with OCR mistakes); proofread versions
Single images, e.g. self-shot pictures; series of them
Video sequences (e.g. from travels); series of them
Document entries can be independent from existing copies. E.g. you know that a document exists, e.g. a movie or a song you heard people talk about, but you don't (yet) have a copy of it. Then you create a document record for it, with all you know about it. If you would like to have a copy of it, put it in your list of desired documents, and when you meet a friend who has a similar collection, you can automatically exchange documents. A document thus can be marked "desired" or "no longer needed". Whenever a chance appears to get a "desired" document (in radio, TV, in a library, from a friend, from the Internet, ...), the computer (i.e. the document acquisition robot docquir, pron. "docker") catches it. This applies not only to known single documents, but also to documents only given by their properties (e.g. "all Westerns with John Wayne", every new release of KDE, etc.). Similarly, documents can be marked as no longer needed. In general, they are stored on rewritable media, so when a medium with no-longer-needed contents is mounted, the computer deletes this contents and frees space (after asking, perhaps).

Documents without copies

Document entries can be independent from existing copies. E.g. you know that a document exists, e.g. a movie or a song you heard people talk about, but you don't (yet) have a copy of it. Then you create a document record for it, with all you know about it. If you would like to have a copy of it, put it in your list of desired documents, and when you meet a friend who has a similar collection, you can automatically exchange documents.

Also, it can be useful to divide a document into part documents, and have copies only of the parts. You then have documents for the whole and the parts, but only copies of the parts. Examples are MP3 versions of the tracks of an audio CD.

Document acquisition robot ("docquir")

A document that has no copy (yet) can be marked "desired". Whenever a chance appears to get a "desired" document (in radio, TV, in a library, from a friend, from the Internet, ...), the computer (i.e. the document acquisition robot docquir, pron. "docker") catches it. This applies not only to known single documents, but also to documents only given by their properties (e.g. "all Westerns with John Wayne", every new release of KDE, etc.).

Similarly, documents can be marked as no longer needed. In general, they are stored on rewritable media, so when a medium with no-longer-needed contents is mounted, the computer deletes this contents and frees space (after asking, perhaps). Deleting files can be useful if there is a new version available and the old version has no benefits but only more bugs. (Keeping old versions of computer software can make sense, however, if newer versions require newer hardware and there is existing older hardware that works only with old software).

Mimetypes in Unibas

Like other systems, Unibas uses MIME types to identify the types of documents and to associate documents with applications.

Unibas uses the mimetype table which is filled from infosources that provide mimetypes and filename extensions, most notably the official IANA registration site.

But Unibas also knows a few analog document types and defines pseudo-mimetypes for them. These are uppercase two-letter codes that cannot occur as real mimetypes. The list currently (Dec. 2009) defined is:

SM: Sheet Music: Music in any music notation with little or no text
TO: Text without illustrations (text only), books with mostly text
MP: Moving pictures, on magnetic tape or film or any other suitable medium
CS: computer software, digital but of unknown type (should be avoided)
SP: still picture, e.g. photo on paper, also books with many illustrations
SO: sound only, recorded sound, mostly music on MC or vinyl

To list all pseudo-mimetypes, the following SQL statement can be used:

SELECT * FROM mimetype where type~'^[A-Z][A-Z]$';

All other entries in the mimetype table are valid mimetypes, mostly registered with IANA, but in any case following the scheme type/subtype. Some of them may also be not archivable. Only archivable documents can be made available for playback (can be "shown"). Unibas functions refer to mimetype.archivableCriterium for the exact definition of archivable. For non-archivable documents, you need archivable copies.

Specifications for mimetype table.

id: ID
type: type/subtype as defined in RFC 2045 (subset)
descr: Human-readable informal description
program: Program associated with type, path to executable
ext: comma-separated list of filename extensions associated with this type, in most cases one single extension
support: degree of direct support by Unibas (archiving, extracting metadata)
name: Short, human-readable name
shortname: Very short (usually max. 6 characters) name useful as column header

Mimetype infosources

Infosources that provide mimetypes and filename extensions:

filesuffix.com: About 3000 formats with filename extensions and mimetypes, but over 2000 of them of unknown mimetype, disguised as "application/octet-stream"
filext.com: Windows-centric, it only tries to distribute their (shareware?) Windows program RegistryBooster.exe; very few mimetype data
mark0.net/soft-trid-deflist.html: site of a software like the Unix file tool, with a list of extensions and file type names, but no mimetypes
file-extensions.org: not very informative, mostly ads
dotwhat.net: extensins, names and descriptions, but no mimetypes
fileinfo.com: extensins, names and descriptions, but no mimetypes
www.iana.org/assignments/media-types/: the official IANA registration site, contains mimetypes, links to descriptions, and contact people; extensions not directly but usually in the linked descriptions which are mostly RFCs
wikipedia.org: the Wikipedia also has articles about mimetypes and about single formats, and in sidewindows therein it also notes filename extensions and mimetypes (Internet media types)
www.w3schools.com/media/media_mimeref.asp
docs.python.org/library/mimetypes.html

http://www.iana.org/assignments/media-types/ seems to be the official page to look for mimetypes. It links to pages about the types, which contain a list of subtypes: http://www.iana.org/assignments/media-types/application/

The entries therein contain the full mimetype name, maybe with a link to the format description, and a reference to a person or a document, usually an RFC.

TODO: Unibas extracts all this information, stores the mimetypes in mimetype table, the linked documents as documents with appropriate metadata, the persons as persons with metadata and email (for example the link to:

Paul Lindner, <lindner&mudhoney.micro.umn.edu> , June 1993.

from the

wordperfect5.1 [Lindner]

line gets stored as: email: lindner@mudhoney.micro.umn.edu remark: registered June 1993 the mimetype application/wordperfect5.1 and maybe the rel "registered".) The documents get archived and are "about" (rel) the corresponding mimetype.

RFC documents follow a quite strict format which makes them easily computer-readable. Metadata that can be extracted includes:

Names of persons and companies involved
RFC number
rel to RFC(s) it obsoletes or updates
category
date of publication
title
abstract

Filename extensions are often associated with mimetypes, but the official IANA page does not mention them. Filename extensions are OS specific (for example, MS-DOS and early MS-Windows versions used the extension HTM for HTML, Unix uses html) and some OSs don't even know the notion of filename extension.

http://www.filesuffix.com/ is a site that lists extensions. As such, it is an infosource: filesuffix.com. See its readme. Given an extension, e.g. py, the corresponding page can easily be fetched and its content extracted: http://www.filesuffix.com/extension/py.html Content is:

Type
Category
Mime type
Description

Extension lists can be obtained from: http://whatis.techtarget.com/file-extension-list-A/0,289933,sid9,00.html from http://whatis.techtarget.com/, and others. These provide iterable lists that can be linked with filesuffix.com to get the appropriate mimetypes.

Some other lists of filename lists:

http://whatis.techtarget.com/file-extension-list-A/0,289933,sid9,00.html
http://www.sharpened.net/helpcenter/file_extensions.php?A
http://www.saugus.net/Computer/Extensions/
http://filext.com/
http://en.wikipedia.org/wiki/List_of_file_formats_(alphabetical)
http://www.file-extensions.org/
http://www.gaijin.at/en/lstext.php

None seems to be complete, we need to combine them. Each is a infosource. And of course, many extensions have double meanings.

Copy

The cp table represents the relationship between a document and its medium (CD, DVD, harddisk, ...). There are no IDs for copies. Especially useful for computer-readable media is the file field that contains the file name.

In Unibas, generally one document is one file. Directories are not dealt with as such. This makes many things easier: calculating md5 sums, traversing directories of documents (where in the path does the directory structure end and the document begin?). Some assistance is offered (TODO) to transparently pack directories into files (zip, tar etc.). Columns: objectid documentid file: path in the medium; need not be on root directory remark: rarely used, not automated date date when this particular copy on this medium was created. It allows to tell when a document needs to be re-copied to avoid data loss due to medium aging.

Specifications for cp table

objectid: object ID of medium
documentid: document ID
file: relative path, no leading or trailing slash
remark: non-computer-readable remark
date: date of copying to this medium

For MCs and VCs:

if float < 1, then start time in minutes/60/24;
otherwise, [(A|B| )] [hh:]mm:ss
where (A|B) is optional side
hh: is optional hour (int)
mm is minutes 
ss is seconds.

So in general, the field cannot be used directly to sort, but an entry like "1:23" must be reformatted as " 00:01:23" or "A 2:5" as "A 00:02:05".

This is done by the SQL function normalizeMusiCass(object id, minutes) and is important for digitizing analog media.

Tracks on audio CDs are named as follows: 'track01.cdda', 'track02.cdda', etc. (NOT '/Track01.cda' etc., that is obsolete). Note that CD tracks cannot be accessed using normal filesystem operations. DVD have a directory structure with an often CSS-protected content. Therefore audio CDs and video DVDs cannot be treated as normal Unibas media.

The Unibas archive

This section explains how the Unibas archive is organized, i.e. the harddisk space Unibas uses to keep documents for direct access an the removable media under Unibas' control.

Filenames in the cp table are given relative. Files on a medium need not be on the toplevel directory.

Files are stored on media (CDs, DVDs, HDs, etc.) and these media can be mounted in the directory tree. More than one medium can be mounted at the same time, but one of them, the "active" medium, is special in that documents to be copied "into the archive" will be copied on this active medium. Its value is stored in the ID variable activeMedium.

There is always an active medium. There may or may not be additional mounted media.

The directory (path) to Unibas' mounted media is stored in the variable archivePath. It is complete with the beginning and ending slash, e.g. "/unibas_archive/". In general, a path beginning with a slash is absolute, one not beginning with a slash is relative. archivePath is absolute. As another general rule, a path ending in a slash is a directory, one not ending in a slash is a file. archivePath is a directory.

The mount point is constructed from a pattern and the object ID of the medium, e.g. "oid%1/". %1 is replaced by the object ID. This pattern is stored in the variable mediumPattern. Note that mediumPattern does not begin with a slash, so it is relative (to archivePath), but it ends in a slash, as it is a directory.

That is, when Unibas needs to access a given document, it first looks on which objects (media) there are copies of this document. From the cp table it gets these object IDs and the filenames (which may be relative paths on the medium). Unibas constructs the complete path this way:

archivePath + mediumPattern (with object ID) + (relative) filename path (cp.file)

. Filenames/relative paths do neither begin nor end with a slash. If a file is found this way, Unibas (and its user) is happy, the desired document is accessible immediately. If not, the user (or someone else) needs to make it accessible, i.e. put the medium in its drive and mount it.

For the active medium, the object ID in mediumPattern is activeMedium. For convenience, the complete path archivePath + mediumPattern with activeMedium is stored in te variable activePath.

Note that copies of the document may be on several media. A decision must be made which one to take. If several documents are on the wishlist at the same time, it may be a nontrivial task to decide from which media to take them, in order to minimize the number of media to mount. (One could also associate an access cost to a medium, and try to minimize the total cost to access the documents on the wishlist.)

The mounted media are listed in the table archivemedia:

  Column  |  Type   | Modifiers
----------+---------+-----------
 objectid | integer |
 mount    | text    |
 writable | boolean |
 active   | boolean |

Objectid is the medium object's ID. mount is archivePath + mediumPattern with this objectid (so this is redundant, but convenient). writable is true if the medium can be written directly through OS functions (false for CD-ROMs, CD-Rs, CD-RWs etc., true for DVD-RAMs if they are mounted read-write). Active is true for the active medium and only for it.

Archiving means copying. If taken seriously, archiving means preserving the contents (documents) of media for possibly longer than the expected physical life of the media. That means, the documents must be copied to fresh media in time before the lifetime of the old media end. Otherwise the documents are lost with the media some day.

To be sure, one should make several copies on different media, maybe stored at distant places to avoid losing the documents in a fire or similar incident.

Some documents may have a limited life span, i.e. time during which they are of interest. After that, they can be "forgotten". Examples are software releases: if you know that a new release comes out twice a year or more, you can assign the software a life span of less than one year. (However, it may be useful to keep old software, e.g. in order to install it on old hardware. New software often doesn't take care of the limitations of old hardware.) Unibas could (one day, TODO) support "forgetting" documents. A new field "expiration date" then holds the date from which a document could be deleted. When this day arrives, the user could be reminded of it. He then can decide to really delete it, or prolong the lifespan. If the document is to be deleted, it will be marked as such. Then all copies have to be deleted. Of course, this is only possible if they are on multiply writable media. Then the storage space can be recovered. If they are on write-once media, deleting is not possible, but if a write-once medium contains only documents that are marked to be deleted, then the medium can be thrown as a whole and physical storage space is recovered.

For a multi-user installation such as Unibas.org it will be difficult to get all users to agree on deleting a document. Therefore it will rarely happen.

For most documents, like literary works, music and films, there is no reason to delete them some day. On the contrary, they may even become more valuable with time passing. As stated above, archiving them finally requires copying them to fresh media, and if there is copy protection on the original media, then it has to be overcome, even if it is illegal in some countries.

Replacing an archived document using replaceDocument

Normally, changing an already archived document is no good idea. The md5sum must be recalculated, and there may be multiple copies which all need to be replaced, if possible. The replaceDocument function (TODO: not yet implemented) tries to do its best.

Given the document ID and a file path of the replacement file. If there is no copy: warning. If all copies are on an accessible, writable media (e.g. the active medium), then every copy is replaced by the new file, and the new md5sum is updated. In this case and if everything works well, the function works quietly and can be used in automated tasks. For example, CDs can be ripped and MP3'd without FreeDB access. The missing album and track titles etc. can later be added to the DB, and the id3tags are then updated also. This means changing the already archived files, and for this the replaceDocument function can be used.

All other cases are more complicated. If there are copies on external Unibas sites or on read-only media, no chance to replace without manual intervention or risk of inconsistency. All the program can propose is: Create a new document, copy the data from the old on the new, except the md5sum of course, and make the new one a copy of the old one or vice versa.

The filelist table

The filelist table is a DB table mirror of the list of files in storage media. As the DB cannot access files directly, this table is needed when it comes to comparing the contents of the cp table with the real contents of the storage media.

The filelist table's main purpose is checking the accuracy of the data in the cp and document tables. This essentially means comparing cp and document contents with the output of the ls shell statement. But the DB knows nothing about the ls shell statement, so this statement's output is essentially translated into DBish and the result is the filelist table.

With this table and some SQL wizardry on its contents combined with the cp and document tables, the DB can precisely find out if there are problems, where they are and how they can be solved.

The filelist table has the following fields:

objectid: ID of media object
file: file, or local path inside media
size: file size in bytes
isodate: file modification date/time in ISO format: yyyy-mm-dd HH:MM:SS
hash: this field contains unique hash values of the file's contents calculated as md5sum, but the algorithm might be changed some time; corresponds to document.hash

The file field corresponds in name and content to that of the cp relation. The size field corresponds to that of the document relation.

The isodate field is used to detect when a file has changed. Normally, a changed file's modification date/time is then adapted, so it differs from the date/time stored in the isodate field.

The hash field corresponds to that of the document relation. It is the only reliable indicator that a file contains what the document DB entry says. Its drawback, however, is that its calculation takes a lot of time. Hence this calculation is avoided if possible. If a file's name, size, and isodate are unchanged (i.e. the filesystem and the filelist table show the same file name, size, and isodate), then the file can usually be assumed to be unchanged and the hash value need not be recalculated.

Modules that deal with files, i.e. mainly the document, utilities, and filecheck modules, adapt filelist along with cp and document tables. In general, whenever there is a known change in files on media, this change is noted in filelist as well.

Documents on multiple media

A data medium (video tape cassette, music cassette, book, etc.) can contain multiple documents, and on the other hand several copies of a document can be on different media. The whole and the part Clearly we need sub-documents (parts, e.g. opera ouvertures), and hence a part - whole relation, realized with rel.

Copy of parent on object => copy of parts also there
Copy of part on object => copy of parent not necessarily there

Some problems arise if a document is spread over multiple media. Usually there exists an order on the parts of a document that helps to reconstruct it.

In cp table there is described which part of which document is stored on which medium. Now the following case may occur:

a document doesn't fit entirely on (the rest of) a cassette tape, let's say 20 min. fit on it, the remaining 10 min. have to be stored on another medium.

If you keep another copy of the same document, also divided into two parts, you can only guess from the length of the total and the parts which part belongs to which copy. This problem can be solved by attaching numbers not only to documents and objects (i.e., media), but also to the copies themselves. But this seems necessary only if the described case occurs frequently.

In general, we won't deal with this kind of situation for the following reasons: If a whole is logically divided into parts, then we have the structure for this: the part relation. If a document is divided only because it does not fit on a medium, then this is an unlikely and obsolete situation. We deal with media of 9 GB (DVDs), 25 GB (HD-DVD), 200 GB (portable HDs) and above. Music (MP3) takes very little space compared with that, and DVD films can be compressed into 2-3 GB. So even the biggest documents, full-length films, fit onto portable media.

A document that is not stored as a whole, but only in parts on different media, should not get a cp entry of its own. Instead, for each part there is another document entry, related by a rel entry to the whole document, and only these parts have cp entries. The titles of these part documents or the cp remarks tell which parts give the complete document, and in which order, e.g. "#1/3;next4711" where 4711 is the objectid where the next part of the document is.

In the case of part documents above, all central information about the document is related to the main document entry (that without cp entry), so there is usually no author, composer etc. info directly about the parts, but only about the complete document. From there, it can be derived for the parts.

It's a different case if you have e.g. a book, that forms a document of its own, but consists of several part documents of different authors, e.g. an anthology. In this case, the book as a whole has related (e.g. editor) information, as well as the parts.

Generally: Special cases can be handled by text fields that can be parsed; no extra tables will be created for that. (Thought of Field "parent" in document; if >0, ID of parent document; better:) Table docpart, fields subdoc, parentdoc, order; so, the necessary fields parentdoc and order don't enhance the document table, and if needed, there is the possibility to model the fact that a document is part of several others.

Btw, docparts (sections) might be useful with the thesaurus: whole doc is higher in thesaurus, section lower. What holds for complete docs, in general also holds for parts, e.g. authors, etc.

Field "part" binary: true if only part, false if usually viewed as complete; E.g. "Winnetou I", although part of the "Winnetou" trilogy, is a complete work and thus has part=false; the "Allegro" of a symphony has part=true.

The copy table has a remark field where in fixed form info about physically divided documents can be stored. Which length is to be accorded to documents? Literature classics are printed on books of different sizes, so their length in pages is not a property of the document but of the medium. You may only give them a length in characters. On the other hand, documents like atlases have a specified number of pages. So the means of telling the length of a document seems to depend not only on the technical, but also on the contents type. - Hence the needs to take into account text only (TO) documents, too, because here you will be able to indicate the length in characters, as well as referencing with the accuracy of characters.

With MP and SO there is no problem: you tell in document the length in minutes (and seconds). To be more precise: for MP and SO, i.e. videos, music etc. on musicassettes etc., length is a float, unit is days (for compatibility and because data are already in this format). If necessary, they can be converted in seconds, which is the preferred format scientifically. To convert into a readable format:

select substr(title,1,40), interval(length::text || ' day') from v_doc where ot='MC';

In FILE you can tell the filename with path, but without a DOS drive letter or the like. This depends on the drive you read the medium from. Tell it from the root of the medium. Of course, to access a file this is not sufficient. Under DOS, you additionally need a drive letter, under Unix a mounting point. Both are subsumed as Prefix in the Drive prefix table. Usually under DOS, a floppy drive is "A:", but it might as well be "B:", and for ZIP, JAZ, MO, CD-ROM, PD, DVD and all the other possible drive types you can't tell in advance what the prefix is. If the document is spread over several files, you can tell the subdirectory instead of filename. What about .ZIP files you know only the compressed size of? Store compressed size. Uncompressed version is a document of its own with its own size.

Technical types and Contents types

As UniBas handles all types of documents, these types must be categorized so the user can tell that he is looking for a video or a book or whatever type he wants.

For digital documents, there have been MIME types before and independent from UniBas. Unibas uses its own or KDE's mimetype handling, see http://developer.kde.org/documentation/library/kdeqt/kde3arch/mime.html For analog documents, or non-computer readable documents such as videos on videotapes, Technical Types (TT) have been defined for UniBas.

As TTs you can distinguish:
(SP) still pictures. Books with pictures and text, i.e. books decorated with some pictures and books containing mainly pictures, maps etc. Also for photographs, slides. For computer-stored pictures, MIME types should be preferred.
(TO) text only. Documents that can be stored without loss of information as a sequence of characters (e.g. literature classics on paper books). No pictures etc. allowed. Typically novels etc. in (paper) books. Also etexts, but for those MIME types should be used instead.
(MP) moving pictures (with sound); preferred for video documents
(SO) sound only; preferred for (documents on) music cassettes, CDs, LPs etc.
(CS) computer software, i.e. programs that run on computers

For computer-readable documents, MIME types are the best way to categorize document types. See man mime, KDE's and Gnome's docu about MIME types, and others.

Distinguish TTs from Contents Types such as novels, short stories etc. (genre), which is not covered by MIME types.

With regard to the contents you have:
literary works (written or read aloud on a cassette tape),
concerts (as a video clip, sound only, or score),
movies (on video or as screenplay),
and many others.

Here are some contypes that are or should be used:
sound book
children's sound book
nursery rhyme

Documents are stored on data media which are physical objects.

Examples

This is about how some more complicated cases should be handled. In general, for multimedia documents, the document itself is the digitized representation of it containing all information of the original, plus some meta-information like title etc.; this is where all relationships should point at (e.g. author, composer, actor, director, ...).

For videos, if there is only a non-computer-readable form (on videotape), the relationships must point there, but when it is digitized, the relationships go over to the digitized version, and a relationship to the analog version is established. For DVDs, the ideal format surely is the DVD (contents) itself, maybe some copy-protection removed, in one file that can be freely viewed and copied like other files. For other videos, and also for DVDs as long as files of that size can't be handled, the preferrable format is something like OpenDivX.

For music, if there is only a performed version available, it's this, e.g. in MP3 or OGG format. But the preferred format is that of a music note editor like Capella. It shows the music notes and can as well be played back through a good wavetable sound card. This holds for classical music, piano and guitar literature, songs etc. The ideal note editor can:
be used for editing via computer and MIDI keyboard
play back and show the actually played notes simultaneously
jump to a given place
convert to and from MIDI
convert to and from a standard music-note format, if there is
produce professional quality printed notes
and is free, of course, and available for free OSs.

Movies

The document itself is the 
E.g. Western:
English and foreign titles form docgroup
USA 1951: 1951 is creation date, USA goes in remark
length: float field, interpretation type-dependent, for videos: seconds,
for files: bytes
Contents type: western (or its id in notion table)

Identity of documents

The identity definition for digital documents is very strict: Two file documents are identical iif they are bytewise identical. The criteria for two documents to be identical (i.e., to share the same Document ID) should be rather severe. Identical are:
The contents of two books with the same ISBN number.
The contents of two copies of the same music CD or shareware CD-ROM.
Two corresponding part documents from books, CDs, CD-ROMs etc. as above.
Not identical are:
A document on a paper book and its scanned version.
A scanned book as above and its OCR'd text.
Two different versions of the same program, e.g. Windows 3.1 and 3.11.
A text and its translation into a different language.
A symphony on paper and its performance on CD-ROM.
Note that the document pairs defined here as not identical all are in some relation. They may be considered equivalent under certain aspects. They belong to the same . In a relational database, an equivalence class for documents is an entity table with an ID field and some other fields that define properties common to all documents in this class. A link relation links a document and its equivalence class. However, after considering this and the doc2doc model, I found them too unflexible and favorize rel now.

Document generation

Integrated Documents are documents generated from the Thesaurus/Quotation database. As they all come from the same database, their hyperlinks can be absolutely precise, not only inside a document, but also across documents. They also can be generated on the fly. For details see the Generated Documents section.

Documents (non-integrated) should always be regarded as raw material for integrated documentation: data (mostly text) collected in the database, formatted and prepared to generate integrated documents. Integrated documents have hyperlinks not only inside themselves, but also to other related integrated documents (and to the non-integrated documents they are generated from). Furthermore, the dictionary only explains words and doesn't describe the notions in detail; rather, it provides links to nodes and thus to integrated documents and quotations.

We must categorize photos etc. to easily find them again. However, this seems to be a matter of the Thesaurus rather than the document table. Photos and video sequences should be viewed under this aspect of being raw material for integrated documents.

Extracting document metadata from its file

A newly acquired document (from the Internet or elsewhere) must first get a database entry (a record in the document table) before it can be referred to in the database.

This entry needs some data about the document. The user can enter the data manually (and must do so in case of non-digital documents). However, for digital documents there is the possibility to extract the data automatically from the document itself.

As a very simple method, the bash tool checkmedium provides a rudimentary way to extract document data for a list of files. Just copy the files in the archive and run checkmedium. This takes the filename as document title and provides some reasonable defaults for most other fields. For better results, Unibas has to look deeper into the file.

This task can be easy or difficult, or even impossible, depending on the type of the digital document. In case of HTML documents, for example, it's trivial: the title is given between <title> and </title>. For things like .wav files, it is usually impossible without massive Artificial Intelligence aid, which is far beyond the scope of this little program. In such a case, only the filename can be offered to the user as title.

In the area between these two extremes lie digitized human-readable texts where the meanings of the words can give hints, see the text analysis section. Digital documents usually are in one of a range of relatively fixed formats. This makes it easy to find the needed data.

The recognizer already handles (as of 2001-05-19) many formats correctly. What remains to be done is marked with "todo:".

The recognizer handles not only files, but also directories, and packed data (which are unpacked into a temporary directory and treated as such). During the process, some (internal, temporary) data are global, others are local to the actually unpacked file. Global: todo: original filename (body and extension); central data like title, summary Local: p

Some data (fields in the document or other tables) are single-valued, like the file length; others (text fields) are incremental, like the summary: every useful (todo: filtering) information can be put there, it's much easier for the user to delete unnecessary information than to have to look it up or type it himself. The title requires some more care: the filename as a fallback should only be written there if no qualified title from the contents could be found. todo: file-id.diz: remove chains of identical, non-alphanumeric chars bash - Qt Packer formats
zip
gz
todo:tgz
todo:tar
todo:rar
todo:bz
todo:deb
rpm
todo:lsm
todo:Windows cabinet
Text formats Many of these formats can be converted into plain ASCII text, and then analyzed. Some mark the contents, as HTML, SGML, Latex. For these, it is usually easier to get the data from the marked-up original. Others mark the appearance, like RTF. Markup of these formats is like garbage around the text for the recognizer; it is better removed by a converter.
HTML
SGML, e.g. Docbook; todo:extract author etc.
todo:Latex
todo:DVI, dvi2tty
todo:RTF
troff: title can be extracted from "name" section
todo:infofiles and series of infofiles, by START_INFO_DIR_ENTRY
todo:Project Gutenberg Etext
todo:PDF, to ASCII by pdftotext or directly; depending on PDF version?
todo:Postscript, to ASCII by ps2ascii or directly; depending on version?
todo:MS Windows Help
todo:RFCs

Info quality ranking todo:Best-quality info comes first, so user usually can delete all the rest. Realized by a string list and a ranking list. Text analysis using the words' meanings todo:By Unibas' dictionary, given names and in many cases even family names can be recognized as such, and thus names of authors, editors etc. are found. Helpful hints also come from words such as "by", "copyright", and mail addresses that follow names and can be recognized by their form. Extract related entities Related entities, like persons as authors, composers etc., don't have fields in the document table but in their own tables. todo:If an author of a document is found out, a new record is created and filled with his data, and a relationship to the document is established. In the document form, the user can edit the document and its relationships and easily gets to the related entities. There is no form specially to edit newly acquired documents.

Specifications for document table

A field named rc (for number of copies) could be used to indicate the importance of a document. The system tries to maintain rc copies of the document on different media, not counting the copy on the active medium. If a document has at least rc copies (on reliable, different media), then the copy on the active medium can be deleted. The default value for rc is 1; for imortant documents it should be at least 2.

Some documents become obsolete over time. (This mostly applies to private use of Unibas, not so for the central Unibas site.) For example documents about financial transactions should be kept for some years, then they can be deleted. Or if you know that new versions of a software are released yearly, then you can assume that the current release will be obsolete in a year plus a period you keep the old software although the new release is available. So when creating a document entry in the DB, you can often foresee that the document will be obsolete after a given date. This date can be entered in the document.expires field.

id: ID
hash: formerly called md5sum, this field contains unique hash values of the document's contents; still calculated as md5sum, but the algorithm might be changed some time
title: Title
summary: Brief description, in English or l1
length: duration of a song or video in hours, minutes and seconds; length of a book in pages
size: size in bytes
contype: text field indicating, in English, type of document, e.g. "nursery rhyme". Maybe later replaced by a notion ID.
date: creation/release date
l1: primary language, like "de" for German
l2
mimetypeid: refers to mimetype table
packerid: will probably be dropped
licenseid: Refers to license table
expires: Date when document is likely to become obsolete
prototypic: True if this document is not given as a copy on some medium, but an original, unpublished prototype, of which more concrete versions may exist. If so, then many properties of the different versions are common and should be registered with the prototype, not the individual versions. For example, of a movie there may be many versions synchronized in different languages. The (original) actors, director etc., however, of these versions are common and are registered in the prototype document. That (original) document surely is stored on some medium (object), too, but only the producer or copyright holder has access to it, so it does not make sense to store it in Unibas. If a document is prototypic, then it is no error that it has no copy record in Unibas.
rc: required copies: copy in active medium may not be deleted unless there are at least this number of external copies. Strictly speaking, this is not a property of the document but of a relation between user and document; but as Unibas is currently not really multiuser, this remains with the document table. It is used e.g. in the v_archive_count view.

rc can also be used to mark a document as obsolete. Proposal:
0: delete document, but keep record and linked metadata
-1: delete document and linked metadata, but keep record
-2: delete document, linked metadata and record

Copyright and licensing information

Important for what you can legally do with a document is its copyright and licensing information. The copyright holder information can, if necessary, be held in a relation record. If missing, the author can be assumed to be the copyright holder.

The licence category is one of: "free", "PD", "share", "dl", "book", "com", "cen". It is returned by the function:

QString getLicenseCategory(QString s)

Note that exact censorship information depends on many factors: server's country, client's nationality, country client is actually living in, ...?

If necessary, table "censorship" tells which document is censored in which country (given by countryid). countryid may have a special value meaning "all countries".

An entry in this table means: This document is forbidden in this country. If the server is located in that country, the document shouldn't even appear on the list publicly displayed.

If the server is in a country where the document is not censored, but the interested person is from that country, the download should be rejected and a message displayed saying "Sorry, this document is censored in your state.".

The country of the user is determined by asking him, or by default from data of the request (top-level domain, etc.).

Packerid

Often, there is a defined MIME type, but additionally, the document is packed. Example: Project Gutenberg's zipped text files. One could define a MIME type for zipped text, but in consequence this means defining an extra MIME type for each combination of packer format and (unpacked) MIME type. Rather, I prefer to define a list of known packer formats. Note that tar distinguishes between just putting together a number of files into one file, and additionally compressing them. However, this distinction between compressed and non-compressed packing is rare. Most modern packers have compression included. Hence, it is ok to have a separate non-compressed format (.tar) and a compressed format (.tgz) for tar. Packer formats are (given by their usual extensions): tar, tgz, zip, rar, bzip, bzip2, arj, lha, rar, uc2, ace, gz, zoo

rpm and deb are more than packers.

Abstraction types of documents

Sometimes it is useful to define a document just to serve as a container for other documents. For example, a trilogy in literature usually is no document of its own, but consists of three documents. Or a work of classical music, such as a symphony, on a CD, consists of several tracks. A "container document" here is useful as a hook for information like director, orchestra, soloist, composer etc.

Such a container work is prototypic, an audio work is sound only (SO). An MP3 document (e.g. MP3WRAP) can additionally exist as its version. This way the container work, as it is prototypic, it can be official, public, and has no bias toward a specific format (some people may prefer OGG or whatever). In a proper data structure, the container work is needed to refer to by role records. Playback of a container work can be done as playback of its archivable version, or as playback of archivable versions of its parts.

A program that check for consistency could report "normal" documents without copy record as an error or warning. For container documents it won't issue such a warning. But a container document without part documents could trigger a warning.

Suggested (opt.) field name: atype (for "abstraction type"); data type: char(1); values: 'n' for normal, 'c' for container, 'a' for abstract. Conditions:

'n': document without cp entry triggers warning and a proposal to delete (check relations before!);
'c': document without parts triggers warning;
'a': document without relations or with cp entries triggers warning.
'v': document on non-virtual objects triggers warning and a proposal to either turn document normal or object virtual.

Authors, actors and other artists

While it is always possible to store information about the relation between artists and dcuments in the rel table, this type of metadata is so common that it got an extra table, the role table.

CREATE TABLE role (
    id SERIAL PRIMARY KEY,
    documentid INTEGER NOT NULL REFERENCES document (id),
    ord smallint, 
    personid INTEGER REFERENCES person (id),
    companyid INTEGER REFERENCES company (id),
    characterid INTEGER REFERENCES character (id),
    languageid INTEGER REFERENCES language (id),
    notionid INTEGER NOT NULL REFERENCES notion (id)
);

This table is the usual place to store information about:

actors and companies/labels in movies
musicians, bands, orchestras in a music pieces

The first columns are straightforward:

documentid: The ID of the document. If there exists a prototypic record, then this usually is the ID of that prototyic record and all other versions "inherit" their data from it. The documentid must be provided, a role record without it makes no sense.
ord: If all contributors to a document should be listed, this field could be used as the field to order them. Or they could be ordered first by notion, then by ord. Useful especially for documents with lots of contributors, e.g. films with many actors. Suggest using the order of credits at the beginning or end of a film.
personid: The ID of an actor, musician or other person. In combination with company, a bandleader/conductor/soloist
companyid: The ID of a label, band, orchestra etc., possibly in combination with their leader/conductor. At least one of personid/companyid must be provided.
characterid: The ID of a character, for example Fiona in the movie Shrek. Note that this table is not intended to store data about characters in (paper) books. Use the rel table instead.
languageid: The ID of the language the actor is speaking. The ID of the lyrics language in case of singers. Empty (NULL) for a pure instrumental role (e.g. pianists).

The meaning of last column, notion, is not so straightforward and allows for some flexibility (and confusion). It is the ID of a notion (a record of notion table) that best tells what sort of role the artist plays in this document. For actors in a movie, this is the notionid of "actor". For musicians in a piece, this is the notionid of the instrument, not the instrument player, i.e. "guitar", not "guitarist", or "piano", not "pianist". There are two reasons for that:

It is the way used on CD covers etc. to write the roles.
While "guitarist" and "pianist" are common words, things ar more complicated and ugly with exotic, historic or less common instruments: "pianoforte" - "pianofortist"?

Other roles in movies are denoted by the following notions: producer: chief of staff, executive producer line producer: manages the budget and/or day to day running of a film production cinematographer: person photographing with a motion picture camera art director: oversees the craftspeople and artists who build the sets Set Designer: person responsible for creation of film or scenery

This is only a small part of the roles a person can play in the production of a movie. These notions are listed in the IMDB glossary. Detailed descriptions can be taken from the Wikipedia.

Specifications for CD-Audio and music in general

This section deals with how to represent music in the DB. It talks about the distinction between sheet music and performed music, lists formats for sheet music and performed (recorded) music, and mentions the rel relations between documents in these formats, and persons/groups (composers, artists, orchestras...).

As CD-Audio disks don't have a real filesystem, they are dealt with mainly like MCs and records. You find here specifications how tracks/songs are represented in the DB.

Audio CDs, MCs, LPs and their tracks

Audio albums are mostly CDs, prerecorded MCs and LPs:

object.ot='CD' or 'MC' or 'LP'

For originals, object.orig=true; for copied CDs (CD-R[W]), orig=false but still object.productid can be the same as for the original.

product.rt='Audio'

cp.file='track01.cdda' for CDs; for MCs and LPs, cp.file is in the format: 'A 00:01:00' resp. 'B 00:01:00'.

document.mimetypeid=113 ('SO') licenseid=4: Fair use.

To keep (preserve) a non-archivable document, an archivable copy must be made. This is in an open standard format (e.g. MP3, video/mpeg) and a "version" of the non-archivable document. The archivable copy is on the left side, the non-archivable original on the right side of a rel entry. Hence we have the following chain:

product - object - cp (copy) - (orig.) document - version rel - (archivable) document.

And that archivable document is linked with another copy entry to a Unibas medium, but that is not our topic here.

For MP3 files:

document.mimetypeid=57 ('audio/mpeg') licenseid=4

Original and copy are linked by the "version" rel.

rel.ent1=rel.ent2=5 (document)

rel.id1 is MP3 document, id2 is CD/MC/LP track document

rel.relid=19: version

So here each MP3 track "is version of" corresponding CD track.

The relation to the sheet music composition remains to be defined. That can be a PDF or, better, in a computer-readable note editor format such as MusicXML, that can be printed, played back via MIDI, and edited by note editors.

Artist etc. metadata refers to CD original, not copy (only indirectly). For details about these metadata see Specifications for document table.

Consistency checks should verify: A version of a document is always another document, so if rel.relid=19 (version) and ent1=5 (document) then ent2 must be 5, too, and vice versa: if rel.relid=19 and ent2=5 then ent1 must be 5, too. Also, if rel.relid=19 then id1 and id2 must be different. (A document can't be a version of itself.) Of course, the documents must exist (but copies of them needn't exist).

A work (esp. in classical music) can contain several movements (parts, notionid 44153). For example, Beethoven's third symphony (op. 55, "Eroica") consists of four movements (parts):

Allegro con brio
Marcia funebre. Adagio assai
Scherzo-Allegro vivace
Finale-Allegro molto

Beethoven person id: 2877 Schumann 2882

Relation "version" is used for e.g. documents that are versions of one another. The software should be tolerant as to which document is on which side of the relation. In general, however, the right side is the canonical, original version, that is the most generally usable one. For music compositions (sheet music), this is a format that represents everything the composer wanted to say, and can also be output as MIDI or played back directly on the computer. However, this is usually not available, so the second-best option is the CD track. MusicXML seems to be a candidate for such a format.

or not "version" but "performance"?

That is, for music, the canonical format is this MusicXML (or a better) format. The original (handwritten or printed) sheet music is only a version of this, because it is not fully computer-readable (can't be played back, can't be edited, etc.). Performed music is also a version of it. While the most canonical version is the audio CD version (tracks), the most practical version is the archivable MP3 version. One could argue that the patent-free Ogg is better, but there are free MP3 generators and players available, and MP3 is widely supported in hardware.

For editing: No hierarchical view, but simple table view; several entries can be selected at the same time, and are given container document or artist or whatever information. Part - whole relation doesn't go via "version", but directly. I.e., if there are 4 tracks on a CD for the parts of Beethoven's Eroica, then they can be selected and be made parts of a container document, that has no copy, and orchestra, director etc. is associated with this, and composer as well. Composer is also associated with the composition as such (MusicXML or whatever), so that's redundant, but doesn't matter. Often the "composition as such" is not available, so we need this container document. The "composition as such" is a good candidate for an abstract document.

Conditions for consistency

Given an album (audio CD, MC or LP), the following conditions can be checked, variables set and repairs made:

Are there cp entries? How many? (variable: copies)
Are object type standards met? Variable ot=object.ot; dt=object.dt; ot2=ot.left(2); Warning if ot2!='CD' and ot2!='MC' and ot2!='LP'; Warning if dt!=ot2; propose to repair dt=ot2
If tracks > 0, and ot2='CD': for i=1 to tracks, check if tracki.cdda appears exactly once;
If tracks > 0, and ot2='MC' or 'LP': check if all tracks are named [A|B ] nn:nn:nn
check standards for documents (esp. mimetypes)
count and check versions: variable leftVersions=total number of versions on left side, i.e. album's documents are on right side; same for rightVersions; are all versions (left and right) documents? minLeftVersions=minimum of left versions per track; same with maxLeftVersions, minRightVersions, maxRightVersions normal is minRightVersions=maxRightVersions=0, minLeftVersions=maxLeftVersions=0 or 1; if not, warn (QStringLists of tracks with 0,1,2 left/right versions?) variable: QString thisRelSide="2" if MP3/OGG/WAV etc.other-doc on left side thisRelSide="1" if MP3/OGG/WAV etc.other-doc on right side thisRelSide="" if inconsistent (none or both; count atype='n', sound-mimetypes only) thatRelSide="" if thisRelSide inconsistent, "1" if thisRelSide="2" etc.
Check other-docs (versions): which mimetypes? all the same? warn if not MP3 or ogg or wav count occurrences of mp3, ogg, wav, others
other-docs: standards met? (packer,license,atype etc.)
other-docs: one sound version of every track in activeMedium? of some tracks only? of none?
are versions named different from orig.? (just warn, may be correct)

Sheet music vs. performed music and other definitions

Most music is composed as opposed to improvised, that is, it is written down in some notation. Musicians can read this notation and perform the music according to it. This written-down music is called sheet music.

When musicians perform music, we speak of performed music. In our context, of course, it is not enough to perform it; the performed music must also be recorded in order to get a document. So when we speak of performed music, it is implied that it is also recorded music. This can be on the old vinyl records, music cassettes or audio CDs. All of them, including the digital audio CDs, are not optimal for archiving. Audio CDs have no real file system and the tracks take more space than necessary. Songs in compression formats like MP3 and Ogg take about 1/10 of the space of songs on audio CDs, with minimal, usually unhearable quality loss.

Sources for sheet music: http://www.gutenberg.org/music/ http://www.classicalarchives.com/ (half-free) http://www.mutopiaproject.org/

Formats: MusicXML, MIDI, Lilypond (.ly), Postscript, PDF, musixtex, Finale, and others

Programs: rosegarden buggy, lilypond limited to printing .ly (and MusixXML?) Better(?, but also not very usable): noteedit (buggy: hangs when told to play TSE3) note that there are also partial incompatibilities. MusicXML is not yet complete, several versions are around.

Gutenberg.org uses Finale and MusicXML formats see http://www.gutenberg.org/faq/F-2. MUS from Coda Music http://www.codamusic.com is a proprietary, closed format.

DVDs in Unibas

DVDs are difficult to treat for several reasons: They have a complex structure. They may have copy protection and it may be illegal to circumvent it.

Anyway, Unibas does not handle DVDs directly, but acts as a wrapper around other programs. So if it is legal for you to use these programs at all, it is also legal to use them thgough Unibas.

There are several ways to treat video DVDs in Unibas. Currently (2007), none of them can be said to be the best, every one has shortcomings. The ways are described in detail below, but is you just want a summary as of June 2007, here it is:

Once Matroska is stable, easy to use and fully functional, this seems the ideal way.
Till then, for good DVDs, creating ISO images of shrinked DVDs seems optimal. This preserves most (all useful) features and quality, and it is easy to create DVDs playing in standalone players from these ISOs.
For mediocre DVDs, an XVid copy (about 700 to 1400 MB per DVD) seems enough.

The content (document) of an original video DVD has the mimetype video/x-dvd (id: 122). It is a directory which contains the VIDEO_TS subdirectory. Its license usually is 13: "Copy Protected". This is due to the use of CSS. In some countries copying copy-protected DVDs (which is technically no problem) is prohibited.

A DVD backup is best handled in Unibas as an ISO file, mimetype video/x-video-dvd-image (id: 125). Programs like Ogle, xine, mplayer and VLC play these directly. Because the original is copy protected, the backup should also have the license 13: "Copy Protected".

If there is no copy protection, usually the "Fair use" license 4: "Text, sound and video private user license" applies, which allows private copying ("fair use"). This is the case for vinyls, VHS cassettes, DVDs without CSS and others.

Criteria to choose a copy format:

Playability in standalone DVD player: Ideally, one should be able to use an archive DVD directly in a modern standalone DVD player.
Quality: There is a tradeoff between quality and size, but compressed files with 2 GB often have acceptable quality.
Size: Obviously, this is an important criterium for archiving.
Openness of the format and software: You want to playback the video even after years or decades; if a format is only supported by one application, chances are your videos are useless if this one application is no longer supported and doesn't run on a modern system.
Support: The format should be supported by a broad range of hardware and software, and future support should be ensured. Often support correlates with openness, but not always, e.g. MP3 is better supported but less open than OGG.
Archivability: Ease of management, e.g. copying to a new medium. Physical media have a limited life span, so documents must be copied to new media. As there are many documents that need to be archived, good archivability means a document must not occupy a medium alone, but can be stored on a medium with many other documents. For Unibas, ideally, one document is one file.
Completeness: DVDs contain more than just the movie: extras, making-of, but also commercials, FBI warnings and user restrictions. Interestingly, decreasing completeness can increase quality of the copy by removing annoying features. For example you usually don't want annoying FBI warnings, commercials, and user restrictions. However, it can be useful to have the complete set of spoken and subtitle languages, and even the making-of can be interesting.
Metatags: The format should allow to include meta information (title, playlength, summary, actors, ...).

ratDVD: Once a promising format, but no longer developed, i.e. no future support, not open, Windows-only.
Xvid: Good in size (quality/size ratio), not so good in support by standalone players.
ISO file of original DVD: The DVD is taken and ripped (without recoding or shrinking) to harddisk. An ISO image is created, as if burning to a (9 GB) DVD. However, the ISO file is archived as-is, without burning. The result is highest-quality, complete, compatible, but takes unnecessary much space on HD and a 9 GB DVD to burn. Another disadvantage is that annoying features mentioned above are still there.
ISO file of DVD copy with DVDShrink: This is a very popular alternative. Those who view DVD in a standalone player will prefer this. On Windows, this copy can easily be created with anyDVD and DVDShrink, and burned directly using a free Nero version. Quality is good, completeness as you choose in anyDVD and DVDShrink (annoying features can be left out) However, there is no meta-information. To be written to a DVD, the ISO file must be less that 4 GB (this is a restriction that applies to all files burned on DVD, not specially video files). Bigger ISO files can be stored on HDs.
Matroska: A promising free format, but (2009) there is no easy way to create Matroska files from DVDs.

With (original) DVDs, the usual paradigm of "one document = one file" doesn't hold. Although the CSS and other restrictions are mainly overcome, a DVD still is a directory structure that can't be copied easily.

ISO image of DVDShrink'ed DVD

One possible format for archiving DVDs is an ISODVD file, of preferably less than 4 GB. This is the limit of files that can be stored on DVDs as a single file. ISODVD files can be archived in Unibas without restrictions, and can be played back directly (without mounting) by several programs.

Digitizing analog and semi-digital media

Often you have documents on analog media (books, MCs, LPs, VHS) and want to digitize them (and at the same time, catalog them in the DB). Here are some tips how to do this effectively. This partly also applies to "semi-digital" media, i.e. digital media without filesystems. These media include: audio CDs, DVDs, backup tapes, mini-DV cassettes for DV cameras and so on. Their content can only be accessed via special software (rippers, backup/restore programs etc.).

In the following we use MCs for example, but the same holds for other media types as well. It is suggested that you first examine all MCs whether they are worth being digitized or not. For those MCs you are quite sure you don't want to digitize them, mark them with a strikethrough "D". Those that are already digitized, will be marked with a normal "D". Those where you are unsure, should be marked with a "?". Later, after making up your mind, you can still mark them with a "D" or strikethrough "D" after the "?".

Also, it is useful to make sure at this point that Unibas knows where the MCs belong, i.e. check whether the roomids of the MC objects are correct. Suggestions how to do this are given in the Rooms section.

After making sure Unibas knows where to put the MCs after digitizing, take some MCs to digitize them. This can be done e.g. with the free program Audacity (see below). To set the input and output levels, use e.g. KMix. For output, select Master and set its level. For input, select Line and set its level. To listen to recorded sound, select PCM. If necessary, select the right Dolby setting. You can record, edit and convert sound with Audacity; to record only, krecord is simpler. If the songs on the MC are separated by pauses of at least one second, you can see these and split the one big track per side in several songs. Save the songs as A01.wav, A02.wav etc. for side A, B01.wav, B02.wav etc. for side B. Note the leading zeroes, they are important.

It is recommended to create a directory to put the digitized tracks in, e.g. /unibas_archive/raw/soundrec/MC. Make a subdirectory per MC, e.g. oid12345 where 12345 is the object ID of the MC. In this subdirectory, put the songs A01.wav etc. created with audacity as described above.

Check if songs are identical with what is written on the cover. If necessary, correct cover or insert new correct cover before saving as A01.wav etc.; afterwards the order and names of the songs must be correct when checking with DB ("select * from normalizeMusiCass(objectid, minutes);" in psql). Compare its output with cover and correct DB if necessary (before running bash/fromcass).

To insert songs not yet in DB, use the mcsong bash tool. A better, GUI-based tool is under development (TODO). See cp specs for what the cp.file field holds in case of MCs.

If the MC songs are digitized and saved as .wav files, and the information about the single songs is correct, call bash/fromcass to convert them to MP3, insert id3 tags, and make the DB entries for the MP3 files. Use the (TODO) tool to make sure the digitized and MP3-converted songs are there and the DB entries are correct. Then you can delete the wav files and mark the MC with the "D".

Digitized material that needs to be archived, and maybe preprocessed before, is stored in the subdirectory raw/ of the archivePath (which usually is /unibas_archive/ or /home/user/unibas_archive/.

Under raw/, ripped CDs go in rippedcds/, DVDs as directory structured go in dvdstruc/. Digitized MCs, LPs etc. have their place in MC/, LP/.

Below those subdirectories, the structure is as follows:

The subdirectory contains albums (e.g. video DVD structures) and packages. A direct subdirectory therein corresponds either to a single album (e.g. a video DVD directory structure, a directory with MC or CD tracks) or a subdirectory (package) that contains several such albums.

If a direct subdirectory has only files except perhaps for a VIDEO_TS or video_ts directory, then it is expected to be a single album. The rules for album directory names apply.

Otherwise it is expected to represent a package. Each directory below is expected to represent a single album and will be treated as such. These albums in packages are expected to be ordered alphabetically.

Album and package names consist of an ID or a title or both, separated by a space. The ID is one of:

ean[0-9X]+: The text 'ean' followed by an EAN (or UPC)
oid[0-9]+: The text 'oid' followed by an existing Unibas object ID
pid[0-9]+: The text 'pid' followed by an existing Unibas product ID
isxn[0-9X]+: The text 'isxn' followed by an ISBN or similar number

The numbers directly correspond to product.ean, object.id, product.id, and product.isxn, respectively.

The title corresponds to product.descr, object.descr and, if the album has just one document, document.title.

Names of directories and files in a DVD structure may be lowercase and will be uppercased.

Names of albums in packages may also be just "ot 1", "ot 2" etc., where ot stands for any product.ot like CD, DVD, etc., and the name just indicates the alphabetical order. The numbers may begin with a "0" in case of 2-digit numbers. These names will be appended to the package name to give the album name inside the package.

File names consist of a sequence indicator or a title or both, separated by a space. The sequence indicator may be:

a one- or two-digit number
'A' or 'B' followed by a one- or two-digit number (only for 2-sided media)
"tracknn.*" where nn is a two-digit number

The 'A' or 'B' will be used to denote the sides of two-side media. Apart from that, only the order of a sequence indicator matters.

A name, if there is, can be taken as document title. CD track originals, however, get the name "tracknn.cdda", see Audio CDs, MCs, LPs and their tracks.

Digitizing and editing audio data with KRecord and Audacity

Digitizing and saving audio data with Audacity This is a short Howto about digitizing and saving audio data from MCs, LPs etc. with Audacity (http://audacity.sourceforge.net/). This is by no means a replacement for the original documentation of Audacity which can be found at http://audacity.sourceforge.net/help/documentation.

First you need to connect your cassette or vinyl player (or amplifier's headphone jack) to the computer's Line In jack, and set recording and playback levels with a tool like KMix. The details depend on your hardware and software.

Start Audacity and click on the red Record button. Start playback of the cassette or vinyl record (side A). When side A is finished, turn the MC or LP without stopping recording. The few seconds silence you are recording between the sides can be easily cut away later.

After the complete MC or LP is finished, stop recording. Mark the silence at the beginning and delete it. Mark the first song and select File->Export selection. Save it to locations as discussed above. Audacity supports the usual sound formats directly, for example MP3. For other formats, you need to save as WAV and convert afterwards using a special converter program.

Then delete the selected and saved song, mark and delete the following silence, and proceed with the next song.

Capturing videos from DV cameras

Capturing videos from DV cameras is usually done via a firewire cable. While there also exist GUI applications to do this, this section describes a simple way using dvgrab, a command-line application. Dvgrab (with the right options) grabs the video via firewire cable, splits it in scenes and marks the scenes with the original timecode from the camera. The options are:

-a, -autosplit Try to detect whenever a new recording starts, and store it into a separate file. -t, -timestamp Put information on date and time of recording into file name.

The recommended directory to put these files is:

/unibas_archive/raw/homevideo

So cd into this directory, switch on the camera in player mode (but don't press the Play button) and run:

% dvgrab -a -t

The directory will be filled with files of the form:

dvgrab-2008.03.24_09-41-07.avi

where 2008.03.24 is the date and 09-41-07 the time.

In Debian, it may be necessary to add the user to the disk group:

usermod -a -G disk user

DV is not supported by many programs and takes a lot of disk space. To convert the videos to the compressed and more widely supported MPG format, do: Tools - Convert DV to MPG

Ideas for future features

Documents that might be of interest to persons: Find a way to describe classes of documents that might interest persons, e.g. persid, crit where crit is a criterion (SQL WHERE expression) that says which this person might be interested in.

Keywords associated with document as in abstracts: take notion id in DB, notion as keyword? More precise and smaller that real words.

document must be able to store all LSM info (see sunsite)

to release Unibas: create package: program and public data (select data that is public, e.g. famous persons) so person and company info, too needs public marker

plausibility checks: authors create documents (usually) in their lifetime etc. authors are usually public, friends (invited to parties) private

express contents type and tech. type in terms of notion id, too?

Supported document formats

Generally, Unibas supports open standard formats. However, strictly speaking, it is not Unibas that supports them but the many helper programs unibas relies on.

For example, Unibas needs lame and mpg321 to encode and decode MP3. For metadata extraction from files, Unibas uses the libextractor library (http://gnunet.org/libextractor/). In general, for any free format there is at least one free program that handles the format.

This program can be a library, a command-line tool or a GUI program. In case of a library (preferrably with Python bindings) or a command-line tool, Unibas works best; the user normally doesn't even notice that another program runs underneath.

Some tasks, however, can only be done with GUI programs. To shrink a DVD, for example, there is only one satisfying solution yet, namely the GUI program k9copy. (Copy a DVD entirely in full length is easier and can be done with dvdbackup.)

Without attempt at completeness, here are the most popular formats Unibas supports. For details see the subsections.

jpg
gif
TIFF
MP3
WAV
Audio CDs
Video DVDs (may need external GUI programs, see above)
Quicktime
MPEG2
XVid
PDF
epub
MIDI
HTML
Project Gutenberg Ebooks
OpenDocument Text
OpenDocument Spreadsheet

Unibas knows (in the mimetype table) almost all mimetypes, but that does not mean that Unibas supports them all in the sense of extracting metadata, converting to similar formats etc.

Photos with EXIF information

Videos from DV cameras

Places and events

Places have names, geographical coordinates, maybe population number, maybe an area size, administrative data, etc.

Events are things like concerts, demonstrations, battles, conferences, etc., but also private parties and trips (not longer travels). They are useful to categorize photos, videos and other documents.

Places

Places are geographical locations on the Earth. (Once space travels become usual, we will have to deal with how to denote places outside of the Earth, and extend our model. But for now, there is no necessity for this.)

A place has an id, a native name, an English name, an area and coordinates of the center. This may be an administrative (e.g. a capital in case of a country) or geographical center. Coordinates are given as latitude (lat), longitude (lon) and elevation (ele). The area is given in square meters (float).

Optionally, for bigger areas (e.g. countries), the northernmost and southernmost latitudes (nlat and slat), and the westernmost and easternmost longitudes (wlon and elon) can be given. With them, a rectangular frame is given which contains the area of the place.

Places are geographical. Things like France including the départements outre-mer is a political construction from colonialism, not a geographical place. France mainland only (the hexagone), on the other hand, is a geographical place. One might argue whether or not it should contain nearby islands like Corsica; anyway, Corsica is a place in itself and the table has a partof field which contains the id of a place this place is part of (e.g. Texas is part of the USA).

In many cases, political divisions are unlogical geographically and historically (colonialism etc.). It is not yet clear how to deal with these cases. However, we will try to be compatible with OSM, Wikipedia and other free and public sources.

Area and population are meant including (summarizing) all parts of a place.

Concerning the precision for lat, lon etc., real (at least 6 decimal digits) should do. That gives us 2 digits left and at least 4 digits right of the comma, hence a precision of at least 1/10000 degree.

360 degrees = 40000 km 1 degree = 111 km 1/10000 degree = 11.1 m

And 11 m is the order of magnitude of a GPS receiver and of the smallest places to be stored in this table, namely small houses.

TODO: As text or as notion id, the type field denotes what type of place it is: a house, village, town, country, ...

Places sind geographische oder verwaltungsmäßige Einheiten. Z.B. Taunus ist geographisch und daher unveränderlich für menschliche Verhältnisse (ändert sich vielleicht in Millionn Jahren). Rheinland-Pfalz ist verwaltungsmäßig (politisch) und daher historisch wandelbar. D.h. (politische) Places haben Beginn und Ende, dauern aber meist deutlich länger als Events. Obwohl z.B. der hundertjährige Krieg (1337 bis 1453) länger dauerte als die DDR.)

Events

An event is a gathering of people performing actions or, more general, a set of happenings or actions by one or more actors (human or not), belonging together by topic and in a given temporal window.

Events are things like concerts, demonstrations, battles, conferences, etc., but also private parties (then marked as private) and trips (not longer travels). They are useful to categorize photos, videos and other documents.

Greater events (e.g. longer journeys) can be subdivided into smaller ones (e.g. one-day trips) temporally.

An event has an id, name, starttime and endtime (local time), place id (if appropriate) and location (place name in case there is no appripriate place DB entry).

An event can also have a description, but of usually no more than about 1000 characters. For longer texts, a node should be created and the event linked to that node.

Although usually an event is short, no longer than a day, some longer processes also have characteristics of an event. For example, one would like to group all photos about the building of one's home. It is then useful to declare this process, the building of one's home, as an event which then can last several months.

This example also makes clear that not all photos taken in a certain time span necessarily belong to an event that happens to lie in that time span. A different event could take place in parallel.

If a photo or video is taken at an event, we can say that it "shows" (part of) the event, in the same way as a photo shows a person or thing. If necessary, we might later distinguish between "taken at (event)" and "show (person)", but currently it is the same notion (id 53225).

Infosources

Infosources are external Internet-based sources of information for Unibas. These can be search machines, retailer shops of books, CDs and DVDs, open content sites such as Wikipedia, sites of newspapers and magazines, libraries, and others. As you see, they can be quite diverse.

Originally intended for human readers, infosources can also be queried automatically. Examples are: Search machines, DVD and book stores, open sites like Wikipedia, online TV guides, databases like Musicbrains, etc.

Contents and presentation vary, but some patterns occur in most infosources. Usually the user types a keyword to search for. The results are presented as short items of a list which can go over several pages. The results in a page normally are below each other or, as in kimasmarket.com, there can be 2 or more list items in a row. One important function is to read out these items only, another is to follow the detail links and read out detail information.

As infosources are intended for humans, preparations are needed to query them and the results must be processed to suit a database format. Although certain patterns reoccur, details vary between individual infosources and some adaptation information is needed to query infosources as they individually require it. Results carry individual flavors, too, and are best cached in HTML files and intermediate tables before storing them in Unibas.

The tables are:

infosource: only the attributes most infosources have in common
infosourcefeatures: the not-so-common features
infosourcecontent: intermediate storage for all types of infosource query results; one entry per query result
infosourcedetail: intermediate storage for query results, one entry per attribute-value pair; references infosourcecontent

Optionally, downloaded pages can be cached in /unibas_archive/infosources.

The infosource table lists infosources with the attributes most of them have in common (id, name, type, url, charset etc., for details see below). Infosources can be queried in advance, without a concrete reason, to gather data. Or they can be queried if you have a concrete medium, e.g. a book, you need the metadata of. You could read them from the book and type them in, but Unibas can do that for you. Give Unibas the title or, better, the EAN or ISBN, and Unibas fetches the complete metadata (author, year, publisher, summary...) from one or more Internet book sellers.

How does this work? Suppose you have a copy of Harry Potter and the Goblet of Fire by J K Rowling. Let's suppose there is an Internet bookseller named Booksell.tld. You open their site in your browser and type "potter goblet" in the search field. After clicking "find" a list appears. If it goes over several pages, you can also select a page or click the "next >" button.

Very likely your book is among the first in the list. Click on its item to see details. A detail page appears that, at its bottom, has a "title info" box which contains a summary, product format, publisher's name (also called "label") and ISBN number. Compare this with the one printed below the barcode of your copy. If it is the same, then certainly the other data also will be the same. You can copy and paste them into your own catalog.

What you can do manually, Unibas can do automatically for you. See Internet bookseller page anatomy analyzed for Unibas for details.

Types of infosources

Search machines: Usually only for finding other, more specialized infosources automatically
Encyclopaedias, databases, dictionaries
Newspapers and magazines: It makes sense to archive them decentrally. At least metadata and links to originaly can be extracted. Furthermore, sometimes articles are censored after their release (see http://www.zeit.de/digital/2009-10/online-archive-zerloechert). In these cases, it is useful to have the original version archived.
Booksellers: These usually also sell DVDs, CDs etc.
Link lists (such as dmoz.org): Like search machines, mainly useful to find other infosources.
Thematic portals
Literature collections: There are many collections that provide free (like Gutenberg.org) or partly free documents and metadata.
Official lists like buchhandel.bvdep.com
National libraries

Internet bookseller page anatomy analyzed for Unibas

Let's reconsider the Harry Potter book example and look under the hood, that is, in the HTML source code. You need some basic knowledge of HTML to understand this.

The homepage of our fictive bookseller Booksell.tld presents a lot of advertisements but also something useful: A search field where you can type in a keyword, e.g. a title, beside a combobox where you can select a category, e.g. book, CD ord DVD. Booksell.tld also has "expert search" capability (not all booksellers have this) but we will first concentrate on the simpler "quicksearch" feature.

Here you only type a keyword and select a category. The HTML code of the form, somewhat simplified, looks like this:

<form action="http://www.booksell.tld/quicksearch/" method="post">
<div id="searchbox"><span>Search</span>
<input type="text" name="searchtext" value="" />
<select name="category">
<option value="">all categories</option>
<option value="opt_book">book</option>
<option value="opt_dvd">DVD</option>
</select>
<input type="submit" class="submit_img" value="" alt="Go!" />
</form>

The input to this form is sent to the URL in the first line: http://www.booksell.tld/quicksearch/. To transfer these data, the POST method is used. Data are of the form "attribute=value"; these pairs are concatenated using ampersands ("&") by the browser.

The keyword you type goes in the searchtext field; value="" means that it is initially empty. The category you select goes in the category field; if you select what is presented as "book", the value used internally is "opt_book"; if you select what is presented as "DVD", the value used internally is "opt_dvd". So if you type the keyword "Gulliver", select "book" and click on the "Submit" button, what is sent via POST to the server is:

searchtext=Gulliver&category=opt_book

The advanced search form has many more options and the input is sent to a different URL, but in principle it works in a similar way.

Instead of typing these search values in a browser, Unibas can generate them and retrieve the same answer you would get after submitting a search manually.

The information about where to send a query, which method to use (POST or GET), how to build the POST data in case this method is used, and so on, is stored in the infosource and infosourcefeatures tables.

The answer is usually structured as follows: Usually several items will be found for your query. (If there is none, you will see some text like "No result". If there is exactly one, it is often displayed directly in detail.) These items are displayed in one or more pages of usually 10 items each.

From such a list page, you can navigate to the next and previous page (if any). There is also a line such as "page 1 of 123" so you know how many pages there are in total. And you can click on a list item to see details about it.

What appears then is a detail page which has a different structure than a list page. It usually shows the title, cover image, summary and other details about the selected article. What details are displayed varies from site to site; unfortunately the EAN is only displayed by few sites. The EAN (or UPC in North America, or ISBN for books) is unique per product and therefore important to identify a product, e.g. if you have information about it from different sources.

Now humans can easily read a list or detail page, but what about Unibas? Can it interpret the data in these pages?

It can. The fields' meanings can be recognized by tags in their vicinity. For each field to be read, Unibas needs a regular expression (regexp). If these regexps are provided, Unibas can make sense of both list and detail fields. That means, Unibas "knows" how many list pages there are, Unibas can navigate through these pages, Unibas can find and follow the links to the detail pages, can read the detail information and even download the cover images.

However, for this to work, Unibas needs a lot of information about the structure of the bookseller's site, mostly in form of regexps, stored in the infosourcefeatures table. These are somewhat tedious to write but this can partly be automatized. The idea is to select an attribute's value in an example page, and let Unibas determine the regexp around this value itself from the context.

Infosource-related tables

This table shows the relations of the infosource-related tables:
Table 1. Infosource-related tables
Cardinality Infosource Result
One per item infosource infosourcecontent
Many per item infosourcefeatures infosourcedetail
In the early stages of development, there was only the infosource table. It had a lot of columns and as new types of infosources were explored, more and more columns were added. At the same time, it became clear that some columns were only needed by certain types of infosources. So to gain more flexibility with the program and data, the old fat table was split into two slim ones. The new infosource table holds only the columns needed by almost all infosources. Most columns of the the old infosource table were renamed optional "features", and put into the newly created infosourcefeatures table. It contains the infosource id, the feature name and the feature value.

Table 1. Infosource-related tables
Cardinality	Infosource	Result
One per item	infosource	infosourcecontent
Many per item	infosourcefeatures	infosourcedetail

So these two tables represent the infosource itself. When Unibas reads from an infosource, the result initially is in a format quite different from Unibas' own format. Before converting it to suit the Unibas core tables, it is useful to intermediately store it in its original format, but split up in different attributes.

That is what the other two tables are for: To store the result of querying a page. In infosourcecontent, one row represents the result of one page. In infosourcedetail, which refers to infosourcecontent, one row represents one attribute, e.g. the title, the EAN, the list of actors of a movie (which only later is split up in the individual actors), and so on. The set of attributes naturally varies between infosource types.

Specifications for the infosource table

This table contains one entry per infosource.

Specifications for infosource table.

id: Unique id
name: Human-readable name
type: Type like 'book retailer', 'newspaper', 'weekly magazine', 'library', 'database', 'encyclopaedia', 'search machine' etc.; machine-readable, but not yet used programmatically
url: URL of the starting (home) page
charset: Charset from a meta-tag in the HTML header; usually 'utf-8' or 'UTF-8'
itemsperpage: Max. number of items in a list page
re_nextpage: Regular expression to find link to the following page
re_error: Regular expression to detect an error message in the page; usually contains keywords such as: sorry, unfortunately, no articles, not find, any results, no such DVDs

Further information about infosources largely depends on the type and is stored in the infosourcefeatures table.

The infosourcefeatures table

Specifications for infosourcefeatures table.

id: references infosource
feature: name of the feature
value: value of the feature, for example: regular expression

Features and their meanings. In the following, the various features and their meaning wll be listed, with hints at the corresponding functions in infosource.py.

A list page (usually the result of a broad search) consists of a prefix, one or more elements, and a suffix. An element contains a very short description of an item found, and a link to a detail page about that item.

The following are regular expressions that control the behavior of functions dealing with list pages, especially getListElement() and parseElement().

ignoreBefore: what matches and everything before will be ignored from a list page
re_elementbegin: finds the begin of an element in a list
re_elementend: finds the end of an element in a list
re_e_detaillink: finds the link to the detail page
re_e_lsin: find the LSIN in an element
re_e_mainfield: finds the main field (title) in an element

detaillink_pattern is simply %(link)s if the link is an absolute one, or something like http://www.booksell.tld/%(link)s if it is a relative one. With the re_e_detaillink regexp, the link is found. In getListElementDetail(), %(link)s in the detaillink_pattern is replaced by the link to get the URL of the detail page.

A pattern in this and some other features is a string with a placeholder. The placeholder, such as %(link)s or simply %s, is replaced by another string to give a complete URL.

The ean_url_pattern is a pattern to be instantiated with the EAN to give the complete, direct URL of a detail page (or in some cases of a one-element list page).

Similarly, the lsin_url_pattern is a pattern to be instantiated with the LSIN to give the direct URL of a detail page. LSIN is the infosource-specific Local Standard Identification Number.

Most Internet booksellers ofer a "quick search", where you just type the search word, maybe select a category and click "Search", and an "expert search", where you can search by title, author, actors, and other fields. expertsearch_url is the URL for this expert search. The search parameters are transferred with the HTTP POST method in the getPageByEanPost() function.

Some booksellers and their partners use only their LSIN internally, but provide means to find the LSIN for a given EAN and vice versa. They provide a "finder" page where you can enter the EAN and click on a button to find the LSIN. The finder_page_ean_regexp, finder_page_postdata and finder_page_url features are for automatically finding the LSIN via this finder page.

In these infosources, the EAN is not on the detail page, but on an extra seller page. For such infosources, there is the 'seller_page_url' feature which contains a pattern such as 'http://www.allbooks.tld/gp/seller/search.html?keywords=%(lsin)s'. This is to be completed with a value, in this case the lsin, from the dic. From the page got this way, the regexp seller_page_ean_regexp extracts the desired EAN.

The detail page usually contains a small image of the DVD cover. If you click on that, a bigger image appears. The regexp for the link to this bigger image is re_d_image. On some infosources, however, there is no link to an image, but to an image page which contains the (bigger) image. In this case, re_d_imagepage is the regexp of the link to that page, and re_imagepage_image is the regexp of the link in that page to the jpg file

finder_page_url is the URL of this page, finder_page_postdata is the pattern to instantiate with the EAN and send via POST method. In the page thus retrieved, the regular expression finder_page_ean_regexp yields the desired LSIN. This is done in getLsinFromEan().

The titlesearch* features are used for searching by title. The titlesearchurl feature is an URL or URL pattern which contains the %(title_quoted)s placeholder. If the titlesearchattr feature has no value, the latter is assumed. In this case, the GET method is used, by instantiating the pattern with the quoted title.

If the titlesearchattr feature has a value, POST method is used. From the values of titlesearchattr, titlesearchcat and titlesearchcatattr, a POST expression is built and sent to the URL. This is done in the getListPageByTitle() function.

In expert search forms, the user can fill in some fields. These fields correspond to POST attributes. The search_field_dic feature is a Python dictionary that related Unibas field names to these POST attributes. Example: {"ean":"form[product_number]"} That means, to search for an EAN, the field labeled "form[product_number]" must be filled. The function getPageByEanPost() gets apage by EAN via POST method this way.

re_totalelements is a regexp to find the total number of items a search has yielded.

The re_d_ features are regular expressions to find data in the detail pages.

re_d_actors: actors
re_d_authors: authors
re_d_category: product category, such as DVD or book
re_d_cutting: persons who did the cutting of a film
re_d_date: publishing date of a DVD, CD or book
re_d_directors: directors
re_d_distributor: distributing company
re_d_ean: EAN
re_d_features: features of a DVD
re_d_film_music_by: film music composer(s)
re_d_genre: genre
re_d_image: front cover image
re_d_isxn: ISBN or ISSN number
re_d_languages: (audio, synchronized) languages of a DVD
re_d_length: playlength
re_d_lsin: LSIN
re_d_mediacount: number of media (e.g. 3 CDs in a box)
re_d_ot: object type, i.e. DVD, CD, book, ...
re_d_producer: producing company of a film, publisher of a book
re_d_rating: rating that tells for which age a movie is recommended
re_d_scriptwriters: scriptwriters
re_d_speakers: speakers of an audio book
re_d_subtitles: subtitle languages of a DVD
re_d_summary: some lines of text describing the contents
re_d_title: title
re_d_tracks: list of tracks of a CD

These many data are often presented in a table or list. The rows then are identical up to the attribute and its value. In this case, this row scheme is stored in the main_data_scheme feature, and the attributes are listed in the scheme_field_dic feature. Thus the features are easier to maintain.

The scheme might look like this:

^.*<li><b>%s:</b> (.*?)</li>.*$

and the scheme_field_dic:

{"film_music_by":"Music","summary":"Synopsis"}

From this, the following features are generated in memory (i.e., temporarily, not stored in the database):

re_d_film_music_by = "^.*<li><b>Music:</b> (.*?)</li>.*$"
re_d_summary = "^.*<li><b>Synopsis:</b> (.*?)</li>.*$"

This is done in readdb(), the function that also reads in the "normal" features.

The infosourcecontent table

As an intermediate storage, before reformatting the results to suit Unibas conventions, the infosourcecontent table is used.

Like the infosource table, the infosourcecontent table can be used for all infosource types.

Some infosources such as movie databases provide several entities, namely titles, persons (actors etc.), fictional characters, companies (e.g. producers). This is OK in Unibas due to the entityid in infosourcecontent.

This table serves as:

intermediate storage for essential parts of information from infosources
link between records of different infosources to Unibas records

Specifications for infosourcecontent table.

sourceid: Id of the infosource this data item is from, refers to infosource table
entityid: Id of the entity this item corresponds to, e.g. 23 for product in case of a DVD
mainfield: Content that would be stored in the mainfield, e.g. descr content for product
tostring: More detailed content, e.g. complete name of a person (mainfield is only famname)
publicid: value of e.g. EAN of a product
lsin: Infosource's reference number, e.g. artno for product
detaillink: Link to detail information
readdate: Date/time when data was read out
instanceid: Id in the corresponding Unibas table, e.g. product.id

Let's take as an example the DVD version of the movie "Shrek 3 - Der Dritte" (German version). Let's suppose there is a fictional Internet DVD retailer named Localdvdstore.com which as an infosource has the ID 128. So the sourceid of the Shrek3 entry will be 128. Entityid will be 23 (the ID of the product entity type), as we are talking about a product, a DVD. Instanceid is the product ID if we also have a corresponding entry in the product table. Mainfield is "Shrek 3 - Der Dritte"; tostring may be left empty in this case. Publicid is the EAN of this DVD, here 4047552500226. Lsin (for Local Standard Identification Number) is the article number used by the store, numeric or alphanumeric, e.g. B000QEKMVC. Detaillink is an URL which takes you directly to the article's detail description on Localdvdstore.com's site, bypassing the front page. All important detail information can be read from that page, and it might be useful to store it locally, in case it disappears from the site some day. But usually it suffices to extract the detail information and store only that. Readdate stores the system date/time the page was read by Unibas.

The infosourcedetail table

Specifications for infosourcedetail table. This table contains type-specific data as attribute-value pairs.

infosourcecontentid: references infosourcecontent
attr: attribute name, e.g. "title" or "EAN"
value: value, e.g. title or EAN or summary text

How to find out infosource features

Finding out infosource properties usually is so straightforward that it should even be possible to automatize it. The Unibas development team will surely try to automatize the task, but in the meantime it must be done by hand.

In addition to this Howto, you need some basic knowledge about HTML and regular expressions.

First you create a new infosource record. As a preparation, open the new infosource's URL in a browser. Also open the HTML source in an editor using the browser's View - Page Source function (or similar one). The browser then shows you what a page looks like, and the editor shows how it is made. You need both views.

The name and url can be the same, e.g. "booksell.tld", but the name can be more human-readable, e.g. "BookSell". The type field is currently not used, but you should name types consistently. To find out the charset, search for "charset" in the source. Its value is usually utf-8 ot UTF-8 or similar. This task can easily be automatized.

To get the re_error value, try to provoque an error message, by searching for nonsense (like "ZTGZTRKMNH"); usually a message appears with one or more of the keywords:

sorry
unfortunately
no articles
not find
any results
no such DVDs

or in other languages, their translations such as in German:

leider
keine Ergebnisse
0 Ergebnisse
keinem Ergebnis
keine Artikel
nicht gefunden

Select one such phrase (from the HTML page, not from the list above; these are only examples) and put it in the re_error field.

Next you need a not-too-short list of results. Search for a title keyword that is likely to occur frequently, like "love". You should get a result list that extends over several pages, and you see only the first of these pages. (Make sure the list really extends over several pages.) Count the items in this first page, and enter its value in itemsperpage. Find out how to get to the second page, and from there to the next and so on. If you have it, open the source and try to find that link in the source. It may be labeled "next", or show the image of an arrow, or the number of the next page, so you can't go after its content but its context. As it is a HTML link, it has the form:

<a .*href="(.*?)"

where <a stands for the HTML anchor tag, .* for arbitrary text in-between, and href for "hypertext reference". The part in parentheses is the link you need. .* again stands for arbitrary text, and the ? after it tells the regular expression engine to find the shortest possible match (otherwise, the longest possible match would be found).

Here is a concrete example:

href="(.*?)"s*class="nextPage"

Here, the <a is not necessary, because thanks to class="nextPage" this regular expression matches at exaxtly the right place.

In general, you will want to make the regular expression as short as possible, but long enough to be unique, that is to match at only one place.

Creating infosourcefeatures records.

Note the id of the new infosource record. You will need it in all infosourcefeatures records.

In the HTML result list page, select an item, e.g. a DVD, and click on the link to the detail page. The detail page has a different structure that the list page. It is only about one item, and provides much more data about it. Find the list of product details. It usually looks like:

EAN: 123123123
Genre: family
Synopsis: (some text)
Rating: PG
Length: 99 minutes
Article type: DVD

Of course, attributes and values will be different, but it is important that you recognize some sort of table or list where the main bulk of data is presented. Now find this in the source code. Source code is hard to read, but find some text from the table, like "Article type". Let's suppose you find it in this context:

<li>Article type: DVD</li>

That is an item of a HTML list. Now let's translate that into an infosourcefeature. The article type is, in Unibas, a product category. The value we are interested in, comes after the colon and space and extends up to the </li> tag. That is the part to put in parentheses so it will be returned by the match() function. Spaces in HTML have usually no meaning for the rendered output. So instead of one space, there could be more, or even linefeeds, without changing the output. To account for this, it is useful not to write a literal space in the regular expression, but something that matches an arbitrary number of whitespace (including linefeeds):

s*

Hence our complete regular expression looks like this:

<li>Article type:s*(.*?)</li>

If you enter this in the psql PostgreSQL client, the INSERT statement will look like this:

INSERT INTO infosourcefeatures (id,feature,value) VALUES (100,'re_d_category',E'<li>Article type:\s*(.*?)</li>');

Note the double backslash which is translated into a single one, and the E before this string with backslashes. See the PostgreSQL documentation for more about this, and verify with a SELECT statement that the string is interpreted correctly:

SELECT * FROM infosourcefeatures where id=100;

Maybe not all attributes occur in this list. Sometimes the title, for example, can be recognized by the HTML header tag <h1>:

<h1 class="...">Golden Hits</h1>

The corresponding INSERT statement is:

INSERT INTO infosourcefeatures (id,feature,value) VALUES (102,'re_d_title','<h1[^>]*>(.*?)</h1>');

Note that between <h1 and the closing > there may be some text which we want to ignore. To do so, we insert: [^>]* which matches any number of characters except >.

Look at product details list again. They all look the same except for the attribute and the value. Also look at the source of this list. If the scheme of the items is really the same, then you can save some work by entering this scheme as main_data_scheme feature:

INSERT INTO infosourcefeatures (id,feature,value) VALUES (100,'main_data_scheme',E'<li>%s:\s*(.*?)</li>');

The corresponding scheme_field_dic is: {"ean":"EAN","genre":"Genre","summary":"Synopsis","rating":"Rating","length":"Length","category":"Article type"}

Some more examples and tips.

This source text snippet:

<span class="role">Script: </span>LaGravenese, Richard; <

translates into the following infosourcefeatures INSERT statement:

INSERT INTO infosourcefeatures (id,feature,value) VALUES (102,'re_d_scriptwriters',E'<span class="role">Script:\s*</span>(.*?)<');

Metadata usually come in tables or lists of various forms. Here is a row of a HTML table::

<tr>
<td align=right valign=top>Year:</td>
<td valign=top>1972</td>
</tr>

Don't rely on linefeeds. They have no meaning in HTML. Consequently, they may be there in abundance, as above, or completely missing. Here we leave out the <tr> and </tr> lines completely, take only what is necessary from the middle lines, and replace the linefeed with a s*:

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (140,'re_d_year','top>Year:</td>\s*<td valign=top>(.*?)</td>');

If you use such a short regular expression, remember to always verify that nothing else matches.

Here is an example of missing linefeeds. Normally, to make the source more (human-)readable, you would insert a linefeed between </li> and <li...> but for HTML, this is not necessary.

<li class="ReleaseDate">DVD Release Date: 09/04/2007</li><li class="dvdRating">Rating: <img border="0" src="http://images...

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (104,'re_d_date','<li class="ReleaseDate">DVD Release Date:\s*(.*?)</li>');

Here is an item of a list:

<li>EAN: 4029758642788</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (102,'re_d_ean','<li>EAN: (.*?)</li>');

Another example:

<li class="ean"><span class="name">EAN:</span>4010324021878</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (106,'re_d_ean','<li class="ean"><span class="name">EAN:</span>(.*?)</li>');

These examples just show different tags are used in HTML to tune the optical appearance of the page rendered in a browser. For Unibas, this has no meaning but the syntax must exactly be followed in the regexp so it matches.

<li><b>Label:</b> Penninx</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (128,'re_d_producer','<li><b>Label:</b> *(.*?)</li>');

Sometimes there is so much trash in tag properties you have to cut it away, to keep the regexps readable and because you don't know if it's really always the same trash. Imagine some detail in the div changes, such as the color, and your regexp does not match any longer:

<div style="padding-left: 5px; border-top: 1px solid #B0B0B0;" class="black10"><b>EAN-Code:</b></div>
</div>
<div style="float: left; width: 80%; background: #FFF;">
<div style="padding-left: 5px; border-top: 1px solid #B0B0B0; border-left: 1px solid #B0B0B0;">4022484600466</div>

What matters to us of all this trash is only the attribute name, here "EAN", and its value.

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (132,'re_d_ean','<b>EAN-Code:</b></div>\s*</div>\s*<div[^>]*>\s*<div[^>]*>(.*?)</div>');

It is not trivial to decide what is trash and what is important. But surely with many examples analysed by hand, we get a feeling of it, and can develop tools to automatize the task of creating the right regexps.

<tr><td align="right" valign="top"><b>Disc Format</b></td><td>
2 x <a href="/en/mxcTmzlzRE38/dvddisc.html?SID=fe7c65d4c892">DVD-9</a><br>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (134,'re_d_ot','><b>Disc Format</b></td><td>\s*(.*?)<br>');

As in this example, it might be necessary to filter the result through BeautifulSoup.

<strong>Erschienen bei:</strong> Universum Film, Polyband</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (135,'re_d_producer','<strong>Erschienen bei:</strong>(.*?)</li>');

<b>VÖ:</b></td>
<td>01.01.1997</td>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (136,'re_d_date','<b>VÖ:</b></td>\s*<td>(.*?)</td>');

<strong>Erschienen bei:</strong> <a href="http://www.booksell.de/shop/stb_mus_startseite/mehr-von/musik/fqml/oetinger.html;jsessionid=fdc-6zdfxprx1c2.tc1" title="Weitere Titel dieser Marke">Oetinger.</a></li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (137,'re_d_producer','<strong>Erschienen bei:</strong>\s*(.*?)</li>');

<strong>ISBN-10:</strong> 3-7891-0336-5</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (137,'re_d_isxn','<strong>ISBN-10:</strong>\s*(.*?)</li>');

<tr>
<td class="left">Veröffentlichung</td>
<td class="right">22.11.2001</td>
</tr>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (138,'re_d_date','<tr>\s*<td class="left">Veröffentlichung</td>\s*<td class="right">(.*?)</td>\s*</tr>');

<li><strong>EAN:</strong> 5050582018073</li>

INSERT INTO infosourcefeatures (id,feature,value)
VALUES (139,'re_d_ean','<li><strong>EAN:</strong>\s*(.*?)</li>');

Functions to deal with infosources

None

Knowledge Tree

Unibas' Knowledge Tree is a tree-like scheme to order information in fields of knowledge, here called nodes and stored in the node table. The lower a node is situated in this tree, the more specific it is.

In general, to understand the content of a node (the current node), you need to have understood the contents of the elder siblings of the node, that is, the nodes on the same level that come before the current node.

Also, you should have understood the ancestor nodes of the current node. But that does not mean you must read the contents of these ancestor nodes completely - that would not be reasonable even if it were possible. The higher a node is, the more general is the understanding you need about it in order to understand the current node.

Every node has exactly one parent (except the "root" node which has no parent). Every statement (piece of text) belongs to exactly one node.

Of course you can think of a node having several parents (e.g. urban planning in Tokyo can be subsumed regionally, under Japan, thematically, under urban planning) and the knowledge can be sorted along different topics, e.g. historical, regional and thematic; but this is not implemented directly in the node table. See the top node, Knowledge Tree, for details about how the topics are ordered, and what is included and what isn't.

Generated documents

From the knowledge stored in quotations and organized in the thesaurus, new documents can be created. Nodes in the thesaurus become sections, text is taken from the quotations. Links in the quotations that point to nodes, are changed to links that point to sections in this or other generated documents. (There are also links to outside documents.)

In fact, a generated document (for convenience, let's call it gendoc after the table who represents it) is not a real document, but a series of documents. As new knowledge accumulates in quotations, new documents are released from time to time.

Tables involved in creating integrated documents:
gendoc: Contains one record per document to create. Different versions of generated documents may get different document records, but the same gendoc record.
section: links gendoc and node tables, one record per section
document: real documents created by releases

Releases

When a gendoc is released, a real document is created.

It is put in the archive, just like a document acquired from outside.

A minor release, possible only if the old version is only in the archive and not on other media like CD-ROM, replaces the old version, gets the same document ID. Minor version number is increased.

A major release gets a new document ID, and is linked to the old version by version relation. The old version is kept for backup purposes. Major version number is increased, minor number set to 0.

Filename is generated from first node title (cleared from special chars), and version added. Example: Section "GNU/Linux", major 2, minor 3, gives: "GNU_Linux_2_3". In case of filename clash, it is changed to be unique.

For a given node, the function

getActualDocument

tells which is the actual (canonical) document this node is in.

function to tell for a given link, which is the most recent document a definitive occurrence of the node is in. (if several, list of them?)

when a quotation is edited and saved, its link info is extracted and stored in the link table. So for a given node, one can find out quickly what links point there.

generator: given intdoc entry, works through sections and generates docbook document from quotations (which are in docbook already). Recognizes (raw) links, looks up and inserts intdoc links. What to do with links to non-intdoc documents? Generate link to "from" in the original document? Maybe bool field in quotation telling whether this back-link should be included in generated document? Accept also HTML tags?

Function to compare two section numbers (order). To do this, we bring the number field in a form where alphabetical order is identical to section order. This is the case if all number parts have equal length. Thus we patch the numbers with leading zeroes to length 3 (or 4, if 999 subsections aren't enough).

Algorithm to create document doc:

Remove tmp sectionsGo through all sections of doc, in order given by number: Create subsections recursively, all or up to depth; number them, give them titles, make them tmp, give them depth of parent - 1.Get some global info and write document headerGo through all sections of doc, in order given by number: Go through all quotations of corresponding node...

Section table

When a document is automatically created from quotations of selected nodes, these nodes become sections of the document. This table is used to store information about these sections temporarily, while the document is created.

In a first run, non-permanent nodes are deleted and (re-)created from subnodes of permanent ones. In a second run, the document is created. This is done in the Document module. See the TAB_AUTO file.

, is node with depth: 0=take only the quotations of this node, 1=of this node and sons, etc.; -1=this and all subsections (no limit); section title (NULL for node name); maybe later definitive (link here for definition) or not; tmp bool: true for sections that are created from subnodes and can be removed.

Specifications for section table

gendocid: ID of gendoc this section belongs to
nodeid: ID of node of the section
ord: order
depth: 0=take only the quotations of this node, 1=of this node and sons, etc.; -1=this and all subsections (no limit);

Gendoc table

name: gives filename; if empty, take node title
  The documentid field points to the canonical version of the latest
  created document (i.e., the Docbook version).
  Other versions can be reached via version relation.
minor versions overwrite

Specifications for gendoc table

documentid: The canonical version of the latest generated document
title: Title, default title of first section
name: Filename, to be completed with version; this must created such that
   it fits the restrictions for filenames
minor: minor version number, increase causes old version to be overwritten
major: with a new major number, new document will be created, 
   old document remains existant

Specifications for node table

idparent: ID of parent nodeord: smallint, order among sons of same parent, counted from 1title: text, mandatory, not unique, represents node, gives section title in generated documentsdescr: text, optional, details topic of nodemark: smallint, ?struct: boolean, true if this node is a structural one.text: text body of this node, in DocBook.public: if true, this node's text is intended to be published.

The order of children is given as ord, a (small) integer counted from 1. Its limit is 32767 or higher, anyway enough for a reasonable number of subsections. The program ensures children are always counted correctly, i.e. if a parent has n children, they are numbered from 1 to n. If a child is inserted, the numbers of children after the new child's position are incremented, and vice versa if a child is removed. There is also a function, recalcAllOrds(), to recalculate all ord values, in case they are in complete disorder. It should not be necessary normally, as ord is always kept up-to-date, but it does not hurt, either, apart from taking time.

The struct field is used e.g. in Software subtree. Structural nodes provide the structure of the knowledge tree (or a subtree, e.g. Software subtree). At the top of the knowledge tree, all nodes are structural. Non-structural nodes and their children provide information about a topic, e.g. a program of the Software tree. The highest non-structural nodes (i.e., closest to the root, with structural parent) are typical candidates for generating documents.

thes-software tree: struct. nodes and candidate nodes are copied. "Common features of..." nodes are copied too, because they are not structural, and that is bad but not critical. When copying, they can be filtered out by name. software tree-node: How to filter out subdirectories of a program? If the program is known in the thesaurus (candidate node), subdirs won't be copied. If not, they'll be copied mistakenly. in this case, a new program or even subtree had been added, and it must be edited in the thesaurus anyway. software tree-Windows: First copy all subdirs. Create links for exes and bats. Then delete empty subdirs from bottom-up.

The target format in the knowledge base, i.e. the format for node.text, is always DocBook. However, you may use several methods to enter, edit or generate DocBook.

The simplest method is to enter it directly. For example, to enter a paragraph break, you can just type:

</para><para>

Of course this is not very convenient, so the context menu offers Add DocBook, which lets you choose from the most frequently used DocBook tags. The first entry in the DocBook menu is the para entry, which inserts an ending </para>, then an empty line, then a beginning <para> for the next paragraph. Note that a node.text needs no <para> at the very beginning and no </para> at the very end.

The Add DocBook function offers many tags and constructs. It does not just enter them for you, it also frees you from having to know them by heart, as it briefly explains them. It can also create lists and (TODO) tables.

Another possibility to enter DocBook is to enter (or copy-and-paste) HTML and convert that to DocBook. The editor has a (TODO) rich text mode which produces HTML in a WYSIWYG manner. It is even possible to round-trip: edit HTML as rich text, convert and store as DocBook, then convert that to HTML to edit it in a WYSIWYG manner, and so on.

However, this is not recommended as you can't enter the DocBook-specific tags this way, and if you do have DocBook-specific tags, you will lose them this way. When you convert DocBook to HTML, you lose DocBook-specific information. Convert it back from HTML to DocBook, and it will not be the same as before, although it looks similar in the output.

The perfect WYSIWYG DocBook editor has not yet been written (at least not free to embed it in Unibas).

Another possibility is to take over plain ASCII text. Sometimes you have well-written plaintext (in ASCII or, better, UTF-8) such as READMEs or program documentation, or the output of a commandline tool's help option. You can surround this by <literallayout class="monospaced"><![CDATA[ and ] ]></literallayout> (TODO: or better let Unibas do this for you) and it will be presented literally in the output.

If such plaintext follows the rules of a lightweight markup language (http://en.wikipedia.org/wiki/Lightweight_markup_language) such as reStructuredText, it could be transformed to DocBook (maybe via HTML) and the surrounding literallayout tags removed. Unibas offers this for reStructuredText (TODO, and maybe others).

Nodes marked as public are intended to be published. Nodes marked as private are intended to be kept private (temporarily, at least). Nodes where public is NULL inherit their public value from the parent node.

A problem arises if there is a public node under a (explicitly or by inheritance) private one. It such a node were to be exported and not its parent, the node would "hang in the air". This problem should be reported to the user. Possible solutions:

Make the node private (including its descendants).
Make only the path there public.
Leave as it is, which means the node title is published, but not its text.

The first solution would mean that it simply was a mistake to mark the node as public. The user gets aware of this by the warning, and corrects the mistake.

The second would mean that only some details of the parent(s) are private, but the node is not. The private details are or will be moved to other children of the (up to now private) ancestors. If the users chooses this, the chain of private ancestors up to the first public ancestor will be marked as public, their other children retain their old status, which means that if they were implicitly private up to now, they are now marked explicitly private (because the parent is no public). Of course, the user should carefully review the texts of these parents turned private, if he really wants to publish it, and maybe move some text to private subnodes.

The third (and default) possibility requires no action, but has implications at publishing time. Of private nodes, the text is not published but if necessary (because of a public descendant), the title might be published. If a public node below a private one occurs, the user is clearly warned that, if he takes no other action, the titles of private ancestors will be published. If the titles themselves contain information to be kept private, the user should change them. That is his responsibility after this warning.

Quotation table

Obsolete. A node now has exactly one quotation, node.text. Skip the rest of this subsection. It will eventually be deleted or adapted.

Quotations are text pieces about a subject, normally taken from an existing document, but sometimes without a document source, "taken from life". A quotation usually belongs to a subject, i.e. a thesaurus node, to which it is linked by its Node ID field, as e.g. a quotation about C functions belongs to C programming or a subnode thereof. Adapted quotations from specialized documents can be reorganized by a branch of the thesaurus and taken as raw material for new specialized documents created automatically. In this case usually the Text field is filled with the adapted text.

Another use of this table is to find textpieces about a certain subject, represented as a notion in the Notion table, which is linked to Quotation by the Quotation about Notion table. Thus an universal index can be created which from a notion (or even more generally, from a word meaning that notion) via the From entry in Quotation directly to a text piece in its document. If this is in a (paper) book, the user is given its title and page number, but if it is a scanned or otherwise computer-readable document, the user can view it directly on screen.

These two usages don't exclude each other.

"From" can be of one of the following forms:
"p" directly followed by a page number, e.g. "p99" for page 99, that can be given in roman numerals, too
"#" and an HTML anchor name
"~" and a path with filename eventually followed by "#" and an anchor name (if the document is a series of HTML files)
"c" and a (long) integer that indicates the position of the first character to be read. This is useful not only for plain text files but also for HTML files.
The character indicated by the integer is to be found counting newlines as one character. Thus, if they really are one character,

seek

may be used. In HTML, all characters are counted, including tags. Entities like "ä" that represent one character line "ä" are counted with their full length, 6 in this case. If the character indicated is inside a tag, this is an error that may be detected by parsing the text from the beginning using the HTML Parser.

These specifications are important for all functions and procedures that deal with quotations, e.g. GetStringFromDocument. (as an extension) the letter "l" followed by a pagenumber, a colon (":") and a line number to indicate a line in a page the letter "s" followed by a section indication. This is useful for documents like those of the Linux Documentation Project, which come in different formats (e.g. ASCII text, TeX DVI, Postscript) that may differ in page numbers but where the section numbers are identical. For example, the Linux Filesystem Structure, Release 1.2, by Daniel Quinlan has a section 6.4: Symbolic Links that is in the text version on page 37, in the DVI version on page 35 and in the Postscript version on page 34. However, their contents are the same. In this case the preferred format is

s6.4

. .. and perhaps other extensions, that all have their type determined by the first letter

In books, the page number can indicate a page in the paper version (to be looked up by hand), in the scanned image series version (to be converted to a file name and looked up automatically), and in the OCR'd version, that can be stored as an HTML file with the following properties:

The page number of the original is indicated at the beginning of the text of that page as follows:

     <A NAME="page1234"></A>

The page number (here 1234) directly follows the "page", without leading 0's or blanks. Lowercase roman numerals are also possible:

     <A NAME="pagexii"></A>

This way, the same page reference information can be used to access either of the three types of documents, whichever is best accessible. The results of research and development projects are reports, i.e. documents, and these documents are created from quotations collected during the word at tese projects. Quotations are not only useful to store the results after the project, but also during the project. A quotation can be marked "ToDo" if this part needs further work; a mark "question" reminds that this is a question that naturally arises at this point, i.e. the question can remain but it must be answered.

A "temporary" mark tells that about this topic, at the time of writing there can ne made no final statement, it must be revised later.

A quotation can also be marked as Internet link. In this case, it can be automatically checked whether the target is still there. (It might be that the target is there but has changed its topic; this case is much more difficult to detect automatically.)

Specifications for quotation table

Proposal: Create a field named "text" in node. This field will take the place of quotations. Concatenate all quotations about a node, ordered by quotation.ord, and put the result in the new node.text field. Whether the other quotation fields should also be included in node, remains to be seen. Most of them have proven unnecessary. This way, every text gets a title. Before, the second and following quotations had no title of their own. A notion, if there is text about it, points to a node and not to a quotation. Now, if text deserves a quotation of its own, it also deserves a title and hence a node. That is good. If there is a change of topic, there is a new node. Of course, note texts can still be arbitrarily long. But normally, at least in scientific texts, a node.text has only a few pages. Is there a better name for thesaurus than "thesaurus"? "Knowledge tree"? It must support citations. That is, it should be easy to enter citations, ideally the user has a file open in Okular or some other document viewer, and tells unibas to take the current position in Okular as citation in the current node. Also, when the contents of a document are absorbed in Unibas, citations should be built in at the same time. Citations can follow the Unibas conventions.

Quotation language

The quotation language is usually English.

As an extension, there might be two text fields, one in a "native" language and one in English, and a field telling what the native language is, and a binary field that tells whether the English or the native text field is the master. When editing the quotation, the author edits the master field, and a translator edits the other field (optional if the master is English).

Why English? In short, because English is the de facto lingua franca of the world. Should the world agree some time to a different lingua franca, and if that new lingua franca is really accepted everywhere and forever, then Unibas will switch to that new lingua franca. But that's an unlikely case and till then, Unibas assumes English to be the current and future lingua franca.

For a longer answer, we quote the Norwegian author of a web site about Tolkien's invented languages:

The documents on this web page are written in the Common Tongue of this age, Anglian or in its own term English. Anglian is a not-so-ancient tongue of Britannia beyond the Sea, definitely not the first to be recorded in writing, but once it was recorded, its users never took courage and revised the spelling - no matter how irregular and ridiculous it became as the centuries passed by and numerous sound-changes caused great upheavals in the phonology. Later, the Anglians decided that the rest of the world was just dying to be ruled by them and did their best to comply. Thus their tongue was spread to many lands and continents. Unfortunately, the colonies eventually proved ungrateful and rejected the beneficient, civilizing rule of the Anglians. An early and prominent case was the great (very great, actually) isle of America, but later a number of other states followed suit, and the Empire crumbled. Nonetheless, the Anglian tongue had become very widespread. Moreover, the great (still very great) isle of America rose to a position of immense political power and cultural dominance, flooding the world with movies, soap operas and songs in the Anglian tongue (the songs, at least, could not be dubbed). Though others often found this tongue difficult to pronounce, not just because the spelling only hinted how the words were pronounced but also because the language was full of blurred vowels and weird spirants and sibilants, it at least had a fairly simple grammar. In particular, the language had disposed of cases and different genders of the noun. So after all, it was about as good a lingua franca as one could realistically hope for. In any case, there was no real alternative, to the great regret of the Esperantists and the French.

This, then, was the tongue said writer - himself a Norwegian - had to use when he prepared the material for his site, having a world-wide audience in mind. In some cases, he observed that the British and the American Loremasters do not agree on certain points when it comes to the representation of Anglian in writing. In such cases he felt perfectly free to make his own choice.

So do the authors of Unibas. Americans might choose the American system, the British might choose the British system, and others choose what they want, which may be a mixture of the two, as long as it is readable and understandable to the readers.

The same holds for the other languages, where they are used. You may write "quatre-vingt-dix" or "nonante" in French, "Januar" or "Jänner" in German, use new or old spelling in German, traditional or modern characters in Chinese, as long as the audience can understand it.

Directory tree and bookmark import-export

Software subtree

The Software subtree of the thesaurus discusses computer software of all sorts. In a system where software is arranged hierarchically (in contrast to, e.g., putting all binaries into few /bin, /usr/bin or similar directories), each node in this subtree corresponds to a directory of a real software tree, and under this the executables reside.

Windows is an example of such a system where every program usually gets its own directory below the "c:Programs" directory. There is a utility called "cress" that generates a directory tree below "c:WindowsStart menu" with links to executables (.exe, .com) on a Samba server. Later (today is 2002-12-26), this functionality should be integrated in Unibas (in C++), but as for now, there are only the following scripts "cress" and "cress2" (needed by cress):

#!/bin/bash
# CREates script that creates Software Structure, needs cress2
# Parameters (paths absolute!):
# dir where software is, as seen from Windows
winsoftdir=${1:-/cygdrive/w/Software}
# dir where links are, as seen from Windows
linkdir=${2:-/cygdrive/c/Windows/StartmenÃ¼/Software}
# dir where software is, as seen from Linux
sambadir=${3:-/dose/Software}
# path of generated script
scriptpath=${4:-/tmp/genlinks.bat}

echo -e "#!/bin/bash
mkdir $linkdir
cd $linkdir" >$scriptpath

find "$sambadir" ( -iname "*.exe" -o -iname "*.com" ) 
-exec cress2 "$winsoftdir" "$linkdir" "$sambadir" "$scriptpath" "{}" ;

Now this is cress2:

#!/bin/bash

# Creates links for .exe and .bat files
winsoftdir="$1"; linkdir="$2"; sambadir="$3"; scriptpath="$4"
foundfile="$5"

   len=`echo "$sambadir" | wc -c`
   if [ $len -eq 0 ]; then exit; fi
   relpath=`echo "$foundfile" | cut -c$len-`
   if [ -e "$linkdir$relpath" ]; then exit; fi

   dir=`dirname "$relpath" | cut -c2-`

   echo "mkdir -p "$dir"" >>$scriptpath
   echo "ln -s "$winsoftdir$relpath" "$linkdir$relpath"" >>$scriptpath

Cress requires cygwin to be installed under Windows. From the Samba software tree, it generates a script that, when run under cygwin with "bash genlinks.bat", generates the links to the executables in this software tree. Instead of selecting programs from the "Programs" menu, the user selects them from the "Software" menu where he finds the created links. (The "Programs" menu still exists aside of the Software menu.)

Cress and cress2 belong to Unibas as long as there is no solution integrated in Unibas. To use cress:

create Software subdirectories with programs, consistent with thesaurus Software subtreerun cresstransfer generated script (e.g. genlinks.bat) to Windows (for example, copy it to the share tree exported by Samba)run script under cygwin like this: "bash genlinks.bat"

Dictionary: Word, Notion and Meaning

The Unibas dictionary is especially suited for scientific and technical terms that have clearly defined meanings. Words and notions (the meanings) are separated; the notions are language-independent. This idea has already been used for WordNet, but so far only for one language: English. UniBas extends this to other languages, too: The English word "tree" is linked to the notion of a tree, the French word "arbre" is linked to the same notion, and what you get is the translation of "tree" to "arbre".

One goal of the UniBas dictionary is to be able to search inflected forms as well. For example, the user is reading a German text which contains the word "gelacht", and he wants to look up its meaning. If he copies and pastes the word into an ordinary on-line dictionary, most likely it won't be found, because it's an inflected form. Of course, the user can guess from the German grammar rules that the infinitive is "lachen", but why not make the computer do the work of guessing the infinitive? Taking "ge-" and changing "-t" into "-en" is a simple task for it. There are more complicated cases, but all are well feasible.

Words and notions only constitute a dictionary, not an encyclopaedia. The definitions of notions are short, only suited to identify things to humans, not to explain things. But notions can be linked to nodes in the thesaurus tree.

Like the rest of the database, the dictionary tries to be universal. It doesn't focus on one application. Most dictionaries are intended for human users only; you can look up a word and its meaning or translation is displayed. You can do this with the UniBas dictionary too, but also much more.

The dictionary is organized data-centered, not application-centered. On the one hand you have the notions, i.e. what the words mean. When you say "tree", you have a real tree in mind with its stem and its branches and its leaves. When you say "house", you see in your imagination a building with walls, a roof, doors and windows. The representations of these notions, these images you see in your imagination, are the objects of the notion entity, stored in the notion table. They are independent of the words used in the different human languages to denote them. These words are stored in a separate table, the word table.

Note that in WordNet, UniBas' words are called "word forms" while notions are called "word meanings". Of course words and meanings are related. Words denote notions, mean notions, and these meanings are stored in the meaning table.

That's the topic of this section: the word, notion and meaning tables, what to store in them, and what they can be used for.

You can see this dictionary as a center of computer's knowledge Also, it allows application programs to be internationalized automatically. The dictionary is special in the database in that almost everything is connected to it: Persons, famous documents, etc. Maybe it's the place where to begin with real computer knowledge. Notions could be linked to classes in object-oriented computer languages. To begin with, the other tables (entities) are linked with the dictionary. Dictionary entries about persons are formatted in a way the computer knows how to extract its information to store it in the fields of the person table. And person table entries can be re-formatted for use as dictionary entries. The same holds for document etc. entries.

This raises the question: Retain both, dictionary and e.g. person entry? I'd say yes, as long as the computer knows about it. Not all persons might be interesting for dictionary, but when searching in dictionary, the users should have the possibility to extend the search to persons of person table, who are then output in dictionary-like fashion. Same holds for documents.

Dictionaries in a relational database

Dictionaries are shifting from paper to electronics. Yet in most cases, the electronical ones are rather simple copies of their paper originals. But with a little more effort, they can be improved. Not only the computer's storage capacity, but also its computing power can be made use of.

Simple data structure and simple applications to start with In the case of bilingual dictionaries (English-Foreign, Foreign-English), most of the information is recorded twice. If you look up "apple" in an English-French dictionary, you find it translated to "pomme"; if you look up "pomme" in a French-English dictionary, you find it translated to "apple", as you would have expected. Thus in the traditional way, if you deal with several foreign languages, you have to provide a dictionary for each language pair (if a French learns Spanish, it makes little sense to him to use an English<->French and then an English<->Spanish dictionary pair).

Universal through an intermediate representation With the computer, however, the idea of an universal, all-languages dictionary seems not too utopic. If our French student wants to know the meaning of a Spanish word in French, he types in that word and the computer looks up the meaning, in some internal representation. Then it looks up the word(s) for this meaning in the target language, Spanish in our example.

What remains to be discussed is that "internal representation". It's not "pomme", it's not "apple", it's not "manzana", it's not "Apfel", what is it then? That's not important (most likely it will be a number). What matters is what it stands for: that number (or whatever it is), taken as the internal representation of a translated word, must always represent what in English we tell "apple".

For example, in a dictionary that does not translate from one language to another, but explains the meaning of a word in the same language with simpler words (an encyclopaedia), under that number which represents the idea (notion) of "apple", you should find a text like: "The firm, edible, usually rounded fruit of an Eurasian tree (Malus pumila) having alternate simple leaves and white or pink flowers". ("Apple" can also mean other things, e.g. the tree itself. In this case the tree and the fruit are given different numbers, although denoted by the same word.)

What to store in the dictionary

Dictionaries provide different sorts of information about a listed word:
syntactic: what part of speech it is, how it is used to build a grammatically correct sentence
semantic: what it means, i.e. what notion it is linked to
speech level: whether it may be used in a conversation with the Queen, or rather only with your friends
context, field of knowledge: some words occur only in special fields of knowledge; the word "algorithm" for example is hardly used outside of computer science or mathematics
compounds: usually the meaning of a compound can be guessed from the meanings of the part words by some rules. But for many compounds this is not the case. These have to be listed in a compound dictionary, and from every important part there should be a link to the compound (and vice versa?)
pronunciation in an international phonetic alphabet
For the different sorts of information, different tables are used in the relational database.

There is a notion table which contains notions and their explanations (only in English, which I am using as an universal language). Notions are not simply words. For example, "appendix" means different things in medicine and in literature (appendix of a book). So here are different notions, but only one word. On the other hand, different words mean the same, but in different politeness or speech levels. For the notion "to eat", in Japanese there are the words "taberu" and "meshiagaru". So a notion is clearly defined and neutral, but a word can have different denotations and connotations.

Notion table

The notion table represents these notions or ideas, with the fields Notion ID, Notion and Explanation. Notion ID is just an ID (a unique number), Notion is the canonical version of the word, which may differ slightly from the word in the word table (e.g. "nigger" and "negro" are both words for the notion "negro") and Explanation is the explanation, e.g. "a person with dark skin who comes from Africa or whose ancestors came from Africa".

Both fields are given in English. This looks like I am biased: why in English and not in any other language? First, I do regard English as an international language (a lingua franca, not just one of over hundred others, but the one generally used on international meetings etc., and that's why I am writing mostly in English (although it's not my mother tongue).

Secondly, this left the possibility of a more universal Notion table with an additional field for the language (e.g. de for German, fr for French, jp for Japanese...). The ID in this table would no longer be unique, but in combination with the Language field, it would be.

Another table, word, is used to store the words (of any language), and a link table (meaning) to define the relations between Words and Notions. An entry in the Word meaning table can be read "the word Word ID has the meaning Notion ID", or more generally "there are cases where the word Word ID means Notion ID", but also the other way round: "the notion Notion ID can be expressed by Word ID". The meaning table also takes the information that a word in a given use is slang, offensive, polite, archaic, etc.

Notions are not simply words. For example, "appendix" means different things in medicine and in literature (appendix of a book). So here are different notions, but only one word. On the other hand, different words mean the same, but in different politeness or speech levels. For the notion "to eat", in Japanese there are the words "taberu" and "meshiagaru". So a notion is clearly defined and neutral, but a word can have different denotations and connotations. Notions often clearly belong to a special field of knowledge. The notion "prime number" belongs to mathematics, "gland" to medicine etc. The corresponding field of knowledge can be represented as a thesaurus node.

What remains to be examined is whether a notion always belongs to only one node. Many notions, like "vector" seem to belong to more, in this case to mathematics and physics.

Another problem is which node to take. The notion "prime number" belongs not just to mathematics, but to number theory. In the thesaurus, this can be indicated very accurately, much more accurately than this is usually done in dictionaries. Yet, the user might not want to know it that exactly. Average users might not even know which general field of knowledge a subfield belongs to. To them, indicating "mathematics" instead of "number theory" is more useful. However, this means a loss of information. A solution might be to internally indicate the exact subfield, but when output in the style of a traditional dictionary is needed, go up in the theraurus hierarchy until a node is reached that is used as knowledge field in traditional dictionaries, like "mathematics" in case of the subnode "number theory". This means marking special thesaurus nodes as dictionary knowledge fields.

See Thesaurus. Indicating a node ID for a notion allows to create a universal index covering an unlimited number of documents. The explanation should only make clear what is meant by this notion, and not be taken as a complete description of the notion and its context. This is the topic of notions, their quotations and documents generated from their contents.

The explanation should only make clear what is meant by this notion, and not be taken as a complete description of the notion and its context. This is the topic of specialized documents.

If a notion clearly belongs to a subject represented by a thesaurus node, then a link in the Thesaurus-Notion link table is useful to indicate this.

Specifications for notion table

In general, what can be derived using grammatical rules need not (and should not) be in the dictionary. Dictionary entries (word, notion) should be consistent.

Explanations in the notion table should explain a notion in simpler terms, not in terms of the same level; e.g. October is NOT "the month following September and preceding November", but the tenth month of the Gregorian calendar.

Numbers that are built regularly, e.g. "twenty-one", should not be in the dictionary. Numbers that are not built regularly, like "eleven", must be in the dictionary. notion.notion is the canonical, unequivocal term for that notion; may be identical with word.word, but may also be more precise; states exactly what is meant, for people who know it already. notion.explanation is for people who don't know this notion yet; with this explanation they should be able to figure out what is meant.

Cardinal numbers: notion.notion='cardinal number 11'. This should be enough for people who know what a cardinal number is (at least in their mother language), and people without this basic knowledge are not the target of Unibas. So notion.explanation can be empty in this case.

notion.explanation is not only for humans. In some cases it is also used by the program and must therefore follow some rules. It is both human- and machine-readable and therefore care must be taken in its wording. For example, the months of the Gregorian calendar are explained as "[the] nth month of the Gregorian calendar", with "nth" replaced by the respective ordinal number. So March is "the third month of the Gregorian calendar", and notion.notion for March is "the month of March" instead of just "March" because it could be confused with march, the way of walking.

The explanation should not be too long, one sentence is enough. The first sentence of a Wikipedia article usually makes for a good notion explanation. More detailed information should go in the quotations of the thesaurus. To stay with the Wikipedia example: The first paragraph of a Wikipedia article usually makes for a good quotation about a notion. The notion.quotationid should then point to that quotation. (This is more accurate than a notion.nodeid, which was also considered. The nodeid can still be obtained from quotation.nodeid.)

Word table

The word table is used to store the words (of any language), and the meaning table to define the relations between Words and Notions.

The word table contains all information specific to a word: ID, word, language, part of speech, pronunciation (in Japanese there is a problem: can we take the romaji reading as pronunciation?). A word can mean different things, often related to one another. For example, the word "horse" as a noun can mean:

the domesticated horse (Equus caballus) having a short-haired coat, a long mane, and a long tail, used for riding and for drawing or carrying loads; an adult male horse; a stallion; any of various equine mammals, such as the wild Asian species Equus przewalskii or certain extinct forms related ancestrally to the modern horse.a frame or device, usually with four legs, used for supporting or holding (the name probably comes from the animal horse);in sports, a piece of gymnastic equipment with an upholstered body used especially for vaulting;in slang, heroin;(usually plural:) horsepower (hp);in military, mounted soldiers; cavalry: a squadron of horse;in geology, a block of rock interrupting a vein and containing no minerals, or a large block of displaced rock that is caught along a fault

From the notion dictionary resp. from its thesaurus link one can tell that the piece of gymnastic equipment belongs to sports and the mounted soldiers belong to military, and so on. Yet that the use of "horse" for heroin is slang is an information which belongs neither to the word nor to the notion, but to the usage of this word in this meaning, and thus to the link between word and notion.

The definitions for one word can also be completely different, as in the case of "hide":

to put or keep out of sight;the skin of an animal, especially the thick, tough skin or pelt of a large animal;an old English measure of land, usually the amount held adequate for one free family and its dependents

As a working hypothesis, I regard two words as identical when they share the same spelling, pronunciation and grammar. They do not need to share the same meaning.

In this sense, the German word "erschrecken" gets two different IDs. There is a transitive version meaning "to startle" and an intransitive one meaning "to be startled". What causes them to need different IDs is not their different meanings but the fact that the intransitive one is irregular. Its past tense is "erschrak" instead of the regular "erschreckte".

Thus, the English resp. American words "theatre" and "theater" are not the same. If needed, their relation (regionally different spellings of mainly the same word) can be expressed by another relation. In purely relational databases, one must not hesitate to create many tables, if the object is complicated. For every aspect of an object that can't be expressed with a simple data element, generally a new table has to be created.

The Word table needs an additional field named "Grammar" that contains grammatical information. Its content depends on the language and the part of speech (PoS field). In the example of a German noun, it contains the gender and, if it is irregular, information about declination.

every word which cannot be reached via rules, must be listed; if it can be reached, it must be listed only if it provides a new meaning, e.g. "kashikomarimashita" from "kashikoi" names (place, given, male/female) must be marked as such

part of speech must be indicated

proverbs, idiomatic expressions are no data for the word table?
they are best dealt with in documents with links

see also module Dictio in d:datadata.mdb

Word table: words only?

if the meaning of a combination can be derived from
the meanings of its parts, then the combination needs not
be stored in the DB. It adds no information.

if the meaning of the combination is different from
the combined meaning of its parts, then the combination must
be stored in the DB.

For example, the meaning of 
make (something) safe
can be derived directily from the meanings if make and safe.
On the other hand,
make a go of sth.
is a special expression for "make a success out of sth".
that can not be directly derived from make and go.

There are many big dictionaries that are only big because
they include such redundant information.
One example is http://www.mrhoney.de/ with its 
endless word list like http://www.mrhoney.de/031145.txt
Example:
make up a small part of the total economy
make up a substantial part of the costs
make up a tiny group
make up a very important target group
make up a very large part

The Unibas needs to build upon external dictionaries,
because the mass of data cannot be handled otherwise.
But it has to avoid such repetitions and redundancies.

Fields: id word lan: e.g. 'en' for English gram: grammar info pron: pronunciation via IPA, e.g "/sɜː(r)tʃ/" for "search" For some languages, IPA is not practical, e.g. in Japanese, Hiragana is used instead, as in 犬 with pronunciation いぬ ("inu").

Compound table: how to treat compounds?

dictionary must be computer-usable, e.g. for logic programming, but also for human usage: word trainer, reading foreign-language texts and displaying the translations of the words, ... words and compounds treated the same? in the same dictionary? or by SQL UNION?

Rule base: to get the plural, append -s ...

Etymology

Etymological derivation is indicated by the notion 14322: derive. I.e. a word derives from another word, such as the name "Gunther" derives from "gund" and "heri".

Different forms

A different form of a word is linked to its original by the relation represented by the notion 63805: form. The right (2) side is the original. Example: "Gunnar" is a form of "Gunther".

Specifications for word table

word (text, not null): headword in infinitive form, in Unicode, native spelling (i.e. not transcribed with Latin alphabet)lan (text, not null): language code, e.g. "en" for English, "de" for German, "jp" for Japanese, "ahd" for Althochdeutsch (Old German)gram (text): grammatical information; several parts separated by commas; the first part is the PoS (part-of-speech, n for noun, v for verb, adj for adjective, gn for given name, fn for family name); the meaning of following parts depend on PoS; for given names, second part of gram is "m" or "f"pron (text, optional): pronunciation, in Unicode IPO

German grammar

gram: 1st field PoS:
n noun
prp preposition
adj adjective
adv Adverb
interj interjection
v verb
tr transitive
itr intransitive
pred predicative Ergaenzung (fr attribut)
pref prefix
pron pronoun
prov proverb
subj subjunctive (Konjunktiv)


noun:
2nd field: gender: m,f,n or pl for plural only
3rd field: genitive singular; 
4th field: nominative plural; 
-e indicates: append e
ß-sses indicates: cut last character ß and append "sses"
":" means: change last vowel into Umlaut, e.g.
ß-:sser for Schloß translates into Schlösser

| indicates several possibilities, e.g.
"-es|-s" for genitive of "Berg", yields "Berges|Bergs".
"-" alone means unchanged (plural=singular);
"0" (zero) means this form doesn't exist (e.g. no plural)

Meaning table

An entry in the meaning table can be read "the word Word ID has the meaning Notion ID", or more generally "there are cases where the word Word ID means Notion ID", but also the other way round: "the notion Notion ID can be expressed by Word ID". The meaning table also takes the information that a word in a given use is slang, offensive, polite, archaic, etc. This table should be read as follows:

One of the meanings of [Word ID] is [Notion ID]

or vice versa:

One word that expresses [Notion ID] is [Word ID]

How to indicate the subject a word belongs to

Words often have different meanings in different subjects, e.g. medicine, geography etc. In traditional dictionaries, this is often indicated by abbreviations like math, med etc. Now these subjects closely resemble out thesaurus nodes. However, the thesaurus is far more detailed than the few subjects a dictionary usually knows.

The index of a specialized book is a good candidate for a subset of a dictionary. Anyhow, a dictionary only defines words and doesn't explain them in detail. This is the task of specialized books. The dictionary can be used as an index of the whole integrated documentation.

Thus, even the special words from the index of a specialized book should be included in the dictionary. The subject is indicated using the thesaurus node, thus much more precise than in a traditional dictionary. So, in the Unibas dictionary there is more information than in a traditional dictionary. For users who want a behavior that closely resembles that of a traditional dictionary, the extra information can be filtered out. Take the word entry "char array", subject "The C programming language". For a "normal" user, subject "Computer science" would be enough. So some nodes like "Medicine", "Mathematics", "Computer science" can be marked as candidates for subject output in dictionaries. From the node found in the dictionary, in our example "The C programming language", the program can search upwards until it finds one marked that way, "Computer science" in this case, and output this. Moreover, from the depth of the entry, or from some other indicator, the program could guess that this is a very special notion, and decide to not outputting it at all for the normal user. The possibility, however, to display the full entry must be given at the user's request. What's the best criterion for the specialty of an entry, is still a question. E.g. whether it appears in a normal dictionary.

Specifications for meaning table

Word usage example

This table gives examples of the usage of a word in a certain meaning. Rather than providing the text itself, it refers to a quotation, from where the source can be located.

This seems to be an extension of the Word meaning table, but in contrast to that it does not try to b e complete and several entries with the same Word ID - Notion ID pair are possible. See the TAB_AUTO file.

Compound expressions

What was not so clear at this point was how to represent compound (idiomatic) expressions, patterns, and so on. It seemed clear that simply representing them as character strings like simple words would not suffice in the long run. As expressions are composed of simple words, they should be able to be written down just by giving the IDs of these words and some rules. This would save space and from every (main) word there should be a link to the compound, and vice versa. What is more, the grammatical use of a compound usually cannot be described as easily as that of simple words. But these problems could be postponed and temporarily expressions treated like simple words.What was not so clear at this point was how to represent compound (idiomatic) expressions, patterns, and so on. It seemed clear that simply representing them as character strings like simple words would not suffice in the long run. As expressions are composed of simple words, they should be able to be written down just by giving the IDs of these words and some rules. This would save space and from every (main) word there should be a link to the compound, and vice versa. What is more, the grammatical use of a compound usually cannot be described as easily as that of simple words. But these problems could be postponed and temporarily expressions treated like simple words.

Personal vocabulary

This table tells the computer which words (of foreign languages) a specific user, represented by his Person ID, should know and how well he knows them. This table is used by the Vocabulary trainer form.

See the TAB_AUTO file.

Applications

At least this construction allowed for a simple, but efficient dictionary: Select the source word and language, and the target language, and the computer outputs the word in the target language.

It also allowed for a personal vocabulary trainer: For every person and every word that person wants to learn, the computer stores the skill level (ranging from 0 for unknown at the beginning to 1 for perfect), selects a word (frequently those the user knows badly, and rarely those the user knows well) and and asks the user whether he knows it. (Alternatively, the user could type in the answer, and the computer tells whether it is correct or not. But this takes much time, provoques typing errors to be treated as wrong answer, and generally a user who is honest to himself does better by just selecting "known" or "unknown".) If the learner knows it, the knowledge entry value is augmented, otherwise lowered.

These two applications (I used the vocabulary trainer myself to improve my Japanese) convinced me that my ideas were good and that I should go on and refine them.

Text analysis

This table can also be used to facilitate the learner's reading of foreign-language texts. Before he reads it, the computer goes through the text and looks up and displays all words the reader has little or no knowledge of.

as an aid to write foreign-language docs (foreign refers to your own mother tongue): semi-automatically go through document and replace all words which are not in the target language but in your mother tongue, by their translations. Example:

He ging to school where he learned something about Erdkunde.

would be translated into:

He went to school where he learned something about geography.

An atomic one-language dictionary lists the word bases, gives them numbers a compound one-language dictionary lists compounds, built by applying rules on base words given as numbers. A synonym dictionary defines synonym relations, i.e. between different words denoting the same thing. An explanation dictionary lists a term and its explanation (in English).

Traditional two-language dictionaries

The dictionary information thus stored in a database can also be used to create a "traditional" two-language dictionary e.g. in HTML format; combined with Notion and Thesaurus information, specialized dictionaries and glossaries become possible, that may contain links to specialized documents.

Acquiring words from different sources in different formats

KDE translation files

Vocabulary information is available from many public-domain dictionaries; what is badly needed is a target format and programs to put this information from various sources in that format.

The LEO dictionary

Outlook

new quot.

Countries and languages

The country table lists countries with their usual attributes: name, main language, telephone code, ISO 3166-1-alpha-2 code (e.g. "GB" for United Kingdom) capital, The ISO 3166 codes are taken from http://www.davros.org/misc/iso3166.html. According to http://vollfix.uni-muenster.de/vifanie/NedGuideDt/help/countrycode.html, the 2-letter code is also used (in lowercase) to describe the top level domains of HTML documents.

Telephone codes: http://www.kropla.com/dialcode.htm

Countries are important to denote where products (e.g. films, DVDs) are published/sold.

Languages are listed in the ISO 639 standard. the original language of a film, syncronized languages and subtitle languages

Universal binary relationship table

The purpose of the Unibas thesaurus, its nodes and notions is not only to structure human knowledge, but also to make it machine-readable. Unibas "thinks" in terms of these notions. Unibas has a relation called rel to store binary relationships in terms of notions.

Example: How do you deal with the information that a person is author of a document? The rel table stores exactly information of this type. A binary relation has two sides, here numbered 1 and 2. The example's entry in rel has a person on side 1, and a document (book) on side 2. Rel stores not only relations between persons and documents, but any two entity instances, for example also between a company (publisher) and a book. Hence we must, on every side, also tell the entity. So every side consists of an entity ID (ent1 resp. ent2) and an instance ID (id1 resp. id2, e.g. the ID of the person, company, or document). Of course we also need the type of relation, given in the relid field as a notion ID. And finally there is an ord field to order similar relations.

This is not very human-readable (and is not intended to be so). The computer can translate a rel entry into a human-readable form. For example, the query:

select toString(e1.name,id1,true,true,false,0,'1',''), n.id, n.notion, toString(e2.name,id2,true,true,false,0,'1','') from rel, entity e1, entity e2, notion n where rel.ent1=e1.id and rel.ent2=e2.id and n.id=relid;

produces, with some additional reasonable constraints, the following results:

company 1094: Karl-May-Verlag Bamberg     |  6 | publisher | document 14456: Winnetou II
 person 2860: May, Karl                    |  1 | author    | document 14456: Winnetou II

which tell us that Karl May was the author and Karl-May-Verlag Bamberg the publisher of a document (book) called "Winnetou II".

How is a notion to be interpreted? In the case of nouns, it is simple:

Side 1 "is ... of" side 2.

Replace the dots by the notion and you get an almost perfect English sentence:

select toString(e1.name,id1,false,false,false,0,'1','') || ' is ' || n.notion || ' of ' || toString(e2.name,id2,false,false,false,0,'1','') || '.' from rel, entity e1, entity e2, notion n where rel.ent1=e1.id and rel.ent2=e2.id and n.id=relid and id2=14456;

yields:

Karl-May-Verlag Bamberg is publisher of Winnetou II. May, Karl is author of Winnetou II.

In general, it is not that simple, but this is the idea.

To translate the rel information into English, in the general case, we must look at the words, not only the notion they express. Remember that relid is the id of a notion. This notion can be expressed by several words. We need a standard word (not slang or regionalism; this constrains the meaning record) in English (word.lan = 'en'). If the DB is consistent, there must be such a word. How to build an English sentence with it depends on its word class: For nouns, we write: Side 1 "is ... of" side 2 (replace the dots by the noun). For verbs, we write a sentence with side 1 as subject and side 2 as object of the given verb. We might want to also consider time and number (singular/plural), but these are details. For example, a reasonable relation expressed by a verb is "to digitize", in the sense that a person digitized a document. For adjectives, possible usages as rel notions include:

s1 is a (adj) s2
or:
s1 is more ... than s2

(Currently adjectives or adverbs are not in use.)

Examples

Nouns can be interpreted as "ent1 id1 is order'st relid (notion) of ent2 id2".
Example: Document i "Larghetto" is 1st part of document j "Concerto grosso No. 3" by Handel.
Document k "Andante" is 2nd part of document j "Concerto grosso No. 3", etc.


Verbs can be interpreted as "ent1 id1 order'st does relid (notion) with ent2 id2". 
E.g. to obsolete: ent1 id1 (newer version) obsoletes ent2 id2. 

Adjectives: 1 is adj than / is an adj'er version of 2.

Examples: 
A person is director of a document (e.g. a film),
a person is actor in a film (order gives importance),
a person is film music composer for a film etc.
A company is publisher of a document (a book etc.),
a person is author or editor of a book, 
a document is a version of another document
(e.g. MP3 version of a CD-Audio track),
a document is part of another,
a document is a biography of a person,
a person is painter, photographer, etc., of a document, etc.
Useful relations among documents:
A document  is a scan (expl.: scanned version of paper document) of another,
translation, 
proofread version (expl.: of scanned document),
performance (e.g. CD of a music piece given as score),
film version (movie of a novel),
film music (soundtrack of a movie),
adaptation,
script (book of a film),
adaptation (two music scores, or two sound documents,
one being the adaptation of the other for different instruments,
e.g. vocal and instrumental, or piano vs. orchestra, or original vs. karaoke etc.), 
version in general.

person   123   1   4712  document 234 publisher (if only one person, e.g. the author himself)

Simply "version" may be used for e.g. MP3, WAV etc. versions. In general, if it's clear from the document itself what sort of version it is, e.g. from TT (tech. type) or from file extension.

The author of a book is the author of all of its parts.

A book is always created (not necessarily published) at the lifetime of its author.

No version of a document can appear before its creator(s) are born.

A translation and other derived versions of a document never are created before the document itself.

A biography is never created before the begin (but maybe during) the lifetime of the person it's about.

etc.

Constraints

ent1 and ent2 must be an entities, i.e. listed in the entity table
id1 and id2 must have records in the corresponding tables
relid must have a record in the node table
maybe there is a table that lists valid ent1 - relid - ent2 triples

Specifications for rel table

ent1: smallint, id of entity table, entity of ee1id1: integer, ID of ee1ord: order among ee1s of same ee2relid: integer, represents relation, ID of notion tableent2: smallint, id of entity table, entity of ee2id2: integer, ID of ee2

As the relid is the notion.id, the ids of notions used in rel are fixed. E.g. the id of notion 19: "version".

Notions as relation sides

Up to now, we discussed only "normal" entity instances as rel sides, e.g. persons, documents etc. But it is also useful to have notions as rel sides. For example, it is useful to know that a rock band is a music group.

Usage

This section is about how to use the program, after you have a background of the underlying data structure.

Unibas has several interfaces. Usually you will start it with

unibas

which is a symbolic link to

unibas_gui

The interfaces are:

unibas_gui, the GUI version that presents a window with some help and a menu
unibas_cli, the command-line version that is most useful for automatizng tasks
unibas_web, (TODO) the version with a browser version
unibas_kde, (PLANNED) the fully KDE-compliant GUI version

option --settings filename defines settings file, default ~/.config/Unibas/unibas04. This settings file defines host-specific settings, such as DB connection, archive directory.

The GUI (unibas_gui.py)

The GUI is started with unibas_gui.py or simply unibas (a symlink to unibas_gui.py).

It presents a main window which has a menu bar an a main widget that displays the Unibas manual. With the menu functions, the functionality of Unibas can be reached, especially the various forms that present the data (Form menu).

Data of an entity is usually displayed in a widget, technically a Qt widget derived from InstanceWidget. For example, document data is displayed using a DocumentWidget, embedded in a form like DocumentForm. A form displays one or several related entity widgets. For example, the Family Form displays widgets for father, mother, a list of children names, and details about the selected child.

Each entity widget has a context menu, and the functionality of the whole form consists of the functions in its menu bar plus the functions in the context menus of its entity widgets.

These widgets are derived from InstanceWidget. See there for details.

Whenever a change in the data of an entity widget happens, other widgets may be affected as well. The form tries to keep up to date as a whole. To do this always consistently is not so easy, and forgetting to propagate a change to other parts of a form is a frequent source of bugs. In the alpha stage of Unibas, this will surely happen sometimes. Such bugs should be reported, but a simple workaround is to quit and re-enter the form or to refresh the display, in forms that have this function.

InstanceWidget functionality

Mostly, if you see data of an entity (element), it is presented by the InstanceWidget class. This may be a simple table of attributes and values (the auto layout) or a more individual layout.

The core functionality is presented in context menus. If the field is a foreign key, the context menu has the following items:

Select: Select the entity element this refers to
Show form: Show generic entity form

Otherwise, the menu only presents the core functionality plus entity-type specific functions if there are. The core functionality is:

Save

Saves record explicitly. This is also done when navigating away from the record, but here it is done explicitly.

Rel editor

Calls the rel editor in another window. If the current entity has rel entries, they will be shown there.

Show Connection Tree Form

Shows Connection Tree Form with this entity as root. All records in different tables that are somehow connected to this entity can be reached from there.

Find

Shows a search dialog box.

Go to

Displays a dialog to enter an known ID and jumps there.

First

Jumps to first existing record, ordered by ID.

Previous

Jumps to previous existing record, ordered by ID.

Next

Jumps to next existing record, ordered by ID.

Last

Jumps to last existing record, ordered by ID.

Delete

Deletes the current record.

Add new

Asks the user to enter some descriptive text (title for documents, descr for products and objects, etc.) and from that, creates new entity.

Add new if not existing

Combines search and creation. Asks the user to enter some descriptive text (as "Add new" above). Tries to find it, and if nothing is found, immediately creates new entity from that text.

Clone

Creates a copy of current entity with new ID.

Copy last record

Copies values of last record before this into the NULL fields of this record. This is similar to clone, but:

in clone, the target is created new, here it exists already
here only NULL fields are filled; clone fills all which can fail (constraints)

Select for unification

If there is another record that really represents the same entity, Unibas can automatically merge the two. Here you can select the other entity to merge with this one. Links are automatically adapted.

Unification in Unibas

Suppose you have created two person entries, one for a certain "Willy Writer" (person 4001) and one for a "W. Writer, author" (person 4002), and you later find out that they are really the same person.

Now you could simply delete one record (for instance, 4002) and continue working with the other. There are some problems, though:

You want to keep data you already entered. For example, that record 4002 is about an 'author'. If you just delete 4002, you lose that information.
If there are different data in the two records, you need to decide for one value per attribute. For this, you had best the two records displayed side by side, but only where the values differ.
You already entered related data. For example, person 4002 is the author of document 40002, 'Book2'. You have a rel record that points to person 4002. If you just delete 4002, this reference will point nowhere.
Similar for a persaddr record and a marriage record. If you just delete 4002, these will point nowhere.

So to handle this case satisfactorily, you (or Unibas for you) must do:

Save all data from the record to be deleted (here: 4002) to the record to be kept (4001).
select a value if there are different ones.
Let references to 4002 in the future refer to 4001.
Finally, delete 4002

Unibas can automatize this for you as far as possible. Navigate to the record to be kept (here: person 4001) and select unification from the context menu. Select the record to merge into this (and delete after that; here: 4002). A dialog is shown with attributes that have different values. Accept or edit the values, and click OK.

Now Unibas calculates which references must be changes. If you like, you will be presented a list of the SQL statements the complete operation comprises. At this point you can still accept of reject the complete transaction.

Unification test. With the following data, you can test if the unification works correctly.

To test the unification, create the following records (in psql):

INSERT INTO person (id,givname,famname,remark) VALUES (4001,'Willy','Writer','a.');
INSERT INTO person (id,givname,famname,remark) VALUES (4002,'W.','Writer','author');
INSERT INTO document (id,title) VALUES (40001,'Book 1');
INSERT INTO document (id,title) VALUES (40002,'Book 2');
INSERT INTO rel (ent1,id1,relid,ent2,id2) VALUES (8,4001,1,5,40001);
INSERT INTO rel (ent1,id1,relid,ent2,id2) VALUES (8,4002,1,5,40002);
INSERT INTO address (id,house) VALUES (1001,'Villa 1');
INSERT INTO address (id,house) VALUES (1002,'Villa 2');
INSERT INTO persaddr (personid,addressid) VALUES (4001,1001);
INSERT INTO persaddr (personid,addressid) VALUES (4002,1002);
INSERT INTO marriage (id,husbandid) VALUES (402,4002);

Then open person form, navigate to record 4001, and unify it with 4002. The unifier proposes to keep id 4001, proposed givname is 'Willy' and proposed remark is 'author'; for simple strings always the longer one is proposed. Here are all SELECT statements to check the unification results.

SELECT id,givname,famname,remark FROM person WHERE id=4001 OR id=4002;
SELECT id,title FROM document WHERE id=40001 OR id=40002;
SELECT * FROM rel WHERE ent2=5 AND (id2=40001 OR id2=40002);
SELECT * FROM persaddr WHERE personid=4001 OR personid=4002;
SELECT id,husbandid FROM marriage WHERE husbandid=4001 OR husbandid=4002;

And here are the results as psql sould list them (shortened and explained).


SELECT id,givname,famname,remark FROM person WHERE id=4001 OR id=4002;
 4001 | Willy   | Writer  | author
-- No changes for document.
SELECT id,title FROM document WHERE id=40001 OR id=40002;
 40001 | Book 1
 40002 | Book 2
In rel, all references to person 4002 are replaced by references to person 4001.
SELECT * FROM rel WHERE ent2=5 AND (id2=40001 OR id2=40002);
    8 | 4001 |     |     1 |    5 | 40001
    8 | 4001 |     |     1 |    5 | 40002
The same in persaddr...
SELECT * FROM persaddr WHERE personid=4001 OR personid=4002;
     4001 |      1001
     4001 |      1002
... and marriage and other tables if there are.
SELECT id,husbandid FROM marriage WHERE husbandid=4001 OR husbandid=4002;
 402 |      4001

To clean up and remove test data

DELETE FROM person WHERE id=4001;
DELETE FROM person WHERE id=4002;
DELETE FROM document WHERE id=40001;
DELETE FROM document WHERE id=40002;
DELETE FROM rel WHERE ent2=5 AND id2=40001;
DELETE FROM rel WHERE ent2=5 AND id2=40002;
DELETE FROM address WHERE id=1001;
DELETE FROM address WHERE id=1002;
DELETE FROM persaddr WHERE addressid=1001;
DELETE FROM persaddr WHERE addressid=1002;
DELETE FROM marriage WHERE id=402;

The rel editor

Above we mentioned the Universal binary relationship table rel which relates arbitrary entities through a relation given by a notion of the notion table.

To manipulate such a relationship, i.e. a rel entry, there is the RelWidget which can be embedded in forms an is the main part of the RelDialog. Usually this is called from a entity form which displays a current entity.

The dialog has a list of rel entries where existing rels of the current entity are displayed. From them, the user can select a rel the parts of which are then editable in the lower half. That lower half contains:

two lines about the left side of the rel,
one line to manipulate the notion,
and two lines about the right side of the rel.

At the bottom there is a range of buttons.

The two lines for either side of the rel contain:

a combobox to select the entity type
a button to find the entity once its type is determined
a field that shows a description (e.g. the title in case of a document)
and a button "Show" to show a form with the details of the entity.

The line about the rel type contains:

a field showing the contents of the ord column
a combobox to select a notion from notions already in use as rels
and a button "Other" to choose from other notions.

Usually one uses notions that are already in use as rels, and that appear in the combobox. But if you want to record a rel with a notion not yet used in your DB, you select that notion with the "Other" button. Any notion can be used as rel this way, and of course you can define your own notions and use them as rels, too.

Note that some rels have implications on the program logic. For example, if a document A is a digital "version" of an analog document B, then Unibas knows that when the user wants to see B, it can show A instead.

The buttons at the bottom have the following functions:

Help: Display a short help about the dialog.
Delete: Delete the current rel.
New: Add a new rel and select it, with the current entity on both sides.
Save: Save changes. Note: Without selecting Save, your changes won't be saved on Quit.
Quit: Close the dialog.

The Unibas forms

To view data in the underlying database, you open a form in addition to the main window. Note that when you close the main window, all forms will be closed as well.

There are forms that show all details about an entity (such as the DocumentForm) and special forms that show only one aspect of one (or more) entities. Some of these special forms include:

The FileCheck form: It serves to verify the consistency of the archive.
The Album Form: Shows an album such as a CD, LP or DVD with related data.
The ConnectionTree form: Shows an entity with connected entries.

The FileCheck form

The FileCheck form displays a FileListWidget and a document widget. The FileListWidget checks media for inconsistencies and displays them. The most common problems can be solved directly if the user wishes.

The form provides in its main part a FileListWidget and a document widget, separated by a splitter. The main functionality is provided by the FileListWidget (see below). The document widget is for showing details of the selected file's document.

The FileListWidget

The FileListWidget displays the files in one or more media, that is, files that are there and files that should be there according to the DB (cp and document tables), and analyses them for inconsistencies.

Files where everything is OK are marked "case 1" and displayed with a white background, if they are displayed at all (their visibility can be toggled on and off). This is because the main purpose of the FileListWidget is to show where there are problems, and the "case 1" files only hinder with that.

Problematic files and documents are categorized in cases 2, 3 and 4 and theis subcases, and displayed with a background colored according to the case.

Files, document and their properties are displayed in a table. Each row corresponds to a file/document, and each column is a property (field in DB parlance, e.g. file name, document title, document id etc.). Most fields display useful tooltips when the mouse cursor is over them.

These are the most important columns:

case: Tells if there is a problem and in which category it falls. Tooltip gives more detail.
file: Contains the "explicit" file name resp. local path on the media whose id is in the objectid field, from the filelist or cp table (normally they should be the same anyway). In some cases there is also an "implicit" filename that can be different from this, namely when the document hash matches a different file's hash. That file is then listed in file2.
f, c and d: These fields contain an f, c and d respectively if there is a filelist, cp resp. document DB entry, otherwise they contain a '-'. If field d contains '-', for example, there is no document.
size: The filesize in bytes.
isodate: The file's modification date/time in ISO format.
objectid: The id of the media where the file is stored whose name is listed in the "file" field. From objectid and file fields, the path to the "explicit" file can be constructed.
documentid: The document corresponding to a file is normally found via the cp table. This field thus holds cp.documentid = document.id. Exception: In case 4a, there is no cp entry for an existing document. The document is then found via its hash. This id is emitted when the row changes, so another widget can display document details.
title: Title of the document whose id is in the documentid field. A tooltip shows dome details about the document.
file2 and objectid2: If there is a file that matches document's hash but is different from that in file1, this implies that this other file is the real content of the document. Its path is listed in the file2 field, its media in object2. Hence from objectid2 and file2 fields, the path to the "implicit" file can be constructed.

Inconsistency cases and how to repair them

There is a triangular structure with the corners filelist, cp and document. Every node n1 has two links with each other node (let's pick one and call it n2): one direct link and one via the third node n3). For example, to find a document corresponding to a given file, the standard way is to lookup the document id in cp. However, if a cp entry is not available, there is another way via the file's hash. (This can even be used to find out for a file downloaded from somewhere in the Internet, whether you have it already in the archive. Put the file on the active medium and run the FileCheck form on it. You'll get a case 4b if the file, i.e. its hash, is already known. For an unknown file you get a case 4a. But remember that this is not the recommended way to find this out, because you are breaking consistency in order to do it.)

So if only one link breaks, it can be restored. This redundancy comes from the use of document.hash. However, the structure is not completely symmetric. For example, two files may be identical (in content) which leads to case 2b, having a document linked via hash to two files. But there cannot be two files of the same path (name, in one medium and directory), so an equivalent from the copy side doesn't exist (a copy linked via filename to two files).

This is a list of the inconsistency cases. The descriptions can be quite technical, but usually there is also a simple hint a what could have happened that lead to this problem, and how you can provoque it, for example to test the program.

Listed is always the main case (2, 3, 4) and its subcases (a, b). All subcases of a main case have a condition in common, and add conditions of their own. One could differentiate further, but this is not done to keep things simple. Only at repair time, sometimes additional conditions are checked.

Case 2: The hashes of document and file mismatch, i.e. this file can't be the content of the document.
- Case 2a: No other file matches document.hash. Reason: file contents has been changed. Provoque by changing document.hash or modifying file contents. Repair by recalculating document.hash.
- Case 2b: There is other file that matches document.hash, shown in right column. Reason might be: file f1 has been copied (backed up) to f2 and then the original f1 has been modified. So f1 doesn't match document.hash anymore, but f2 does. In general, a human decision is needed here. The computer cannot exactly know what file to keep (or both). It offers the following possible solutions:
  - Keep original file
  - Keep modified file
  - Keep both and make the modified one a version of the original
Case 3: Copy with no matching filelist.file, i.e. there is no filelist.file that matches cp.file. Left file column remain empty.
- Case 3a: No file matches document.hash, and and the DB knows of no other copy. Reason: file has been deleted, or changed and renamed. The computer offers the following possible solutions:
  - Select an external file and copy it into the archive
  - Delete the copy and, optionally, also the document
- Case 3b: There is no file of copy's filename, but another file, shown in the right column file2, matches document hash. Maybe cp.file was changed but the name of the file on disk remained. Possible solutions:
  - Update copy DB entry, set cp.file to file2
  - Rename file2 to cp.file
Case 4: For the file in filelist, there is no corresponding cp entry.
- Case 4a: There is no document matching filelist.hash. Possible solutions:
  - Create a new document and copy.
  - Delete the file
- Case 4b: There is a document matching filelist.hash. Maybe the copy DB record has been deleted. Possible solution:
  - (Re-)create copy.

The Album Form

The Album Form represents an album (CD, DVD, book, ...) together with most of the data linked to it.

Usually an album is an object (medium) of a product. The medium hold one or more documents (songs, a movie, text, ...) in a form that is difficult or impossible to archive (WAV, DVD structure, printed text, ...). While it makes sense to catalogue these data as well (every library catalog and bibliography does that), Unibas' goal is to convert these legacy formats to easily archivable ones. So the Album Form also needs to manage documents in these digital formats (MP3, ISO-DVD or XVid, PDF, ...) and tag them as versions of the legacy-format documents. Of course, the archival media (e.g. writable CDs, DVDs and BluRays, external HDs, ...) are managed by Unibas as well and represented in the Album Form.

Furthermore, an album often comes as part of a set (e.g. 2 CDs in a box). Then it gets a little complicated. Each album is an object and usually also has a product DB record. The set as a whole also has an object and product DB record.

Usually each track in an audio CD is a document but with multi-CD audiobooks it might make sense to attribute the document to the set as a whole, because splittings are not logical but dictated by the medium. The Album Form must be able to handle such cases, the decision is the user's.

The ConnectionTree form

Connection tree represents an entity with its connections, records of other tables (entities or not, "rel" relations).

Connected records can be of three types:

Entity: The current record has a field (foreign key) pointing to other entity record, or vice versa.
Relationship: A relationship record points to the current record (vice versa not possible).
Rel: A record of the universal binary relation, one side pointing to current record.

When you open the form, only the root is displayed, the current entity record. Hit the right arrow key to open the lext level, which consists of records with direct connection to the current one. Select a record (up/down arrow keys) and hit the the right arrow key to open the lext level, and so on.

You can select a node of the tree and right-click or hit the menu key to get the context menu. Some actions refer to the selected node, some to the current record (root of the tree). These are the context menu actions:

Show Entity Form: Show the selected entity, relationship or rel in a form
Edit: Show the selected entity, relationship or rel in a simple modal dialog
Delete: Delete the selected record
Refresh: Refresh the display in case there was a change in the underlying data
Create dependent relationship: Create relationship pointing to current record (root of the tree)
Follow table path: Enter a path of table names to follow through the tree

The Document Query Form

The Document Query Form lists all documents in a user-defined SQL query.

The command-line interface (unibas_cli.py)

unibas [--settings file] [--set setting] [--command command]

Organizational structure

This section explains the Unibas intra- and inter-site organization. These are just ideas, nothing realized.

A Unibas site is a host running Unibas as a server. Several clients can have access to this server. A client can be on the same host or on a different host than the server.

In the database server's settings (for Postgres, typically in ~postgres/data/pg_hba.conf, see Installation) you can configure which clients have access to the server. In the client's settings you can configure which server the client tries to access.

Several Unibas sites might want to exchange data and documents. Every site defines independently which other sites it wants to exchange data with, and what data it accepts or sends.

A site can have a copy of another site's data (foreign data). Local and foreign data can be distinguished through PostgreSQL schemas. Each site's data are in a different schema. Instances in different schemas may happen to describe the same thing. To account for this, an identification table identifies such instances. Using this table, a site might want to complete its data about a given instance from the other site's data. For instance, if site A might have only a biography of Beethoven and site B might have only some music documents composed by Beethoven. Once the sites agree that A's person 1234 is the same as B's person 2345, namely Beethoven, each can take over the other's data and associate them with its own instance of Beethoven.

http://sql-info.de/postgresql/schemas.html Every Unibas site has a default exchange partner site, that is another Unibas site it usually exchanges data with. Usually this exchange partner is Unibas.org which is represented in the local DB as the schema unibas_org. The exchange partner can be changed temporarily and need not be a highly official site. It can just be a friend's PC. But it must be accessible via network and it must be registered with your DB and you must be registered with him. Every Unibas site that wants to exchange data with Unibas.org has to register with Unibas.org. The site name must be unique, typically it is Sitename.tld, which usually is also the schema name (if it contains dots, uppercase characters etc., it must be but in double quotes, but it is legal in PostgreSql syntax). Alternatively, the schema name may be sitename_tld. When you start a new Unibas site, you are asked which tables you want to have. Unibas offers you a choice of the implemented tables, listed in the entity table, plus of course the required tables, e.g. the entity table itself. These tables are created in your (local) schema. You won't lose any data if you delete data from your partner's schema. It is only a copy. (Your partner may decide to forget, i.e. drop, data, but they should not delete data anybody else is interested in, at least not without a warning and acknowledgement from your side that you have a copy of the data in YOUR schema.) So at any time, you can decide which additional (non-local) data you want to keep. In order to check (abgleichen) data with your partner, however, you must have a copy of part of your partner's data. This section describes the way Unibas sites exchange data and how they are organized. This section deals with the internal and external structure of Unibas sites.

The section Internal structure explains what components (database and programs) a Unibas site contains. The database can be installed on a different computer than the GUI, and furthermore there is also an HTML interface using any modern browser.

The section Inter-site structure explains how several Unibas sites can cooperate to build a distributed knowledge database, coordinated by the unibas.org site.

A Unibas site is the collection of data and programs belonging to one Unibas installation. The section internal structure describes the structure of such an installation. The section intra-site structure describe how several Unibas sites are organized and communicate. There is a main, public Unibas site Unibas.org. This collects all public information of importance. As an open-content site, it exchanges information with all other open-content sites such as Wikipedia, Freedb.org etc.; however, interfaces to these have to be defined yet.

Unibas.org especially exchanges information with other Unibas sites, i.e. sites that use the Unibas program and data structure, but are run independently. Unibas.org coordinates the public Unibas sites. The other sites can download data from Unibas.org, and if they conform to Unibas.org's rules, Unibas.org accepts data from them. The Unibas database mainly deals with entity tables. A record in such a table is called entity element (EE). Unibas.org delegates responsibility for EEs. That means, for such EEs Unibas agrees not to change the EE's data itself, but to leave that task to the Unibas site that is responsible. When that site wants to upload its data, Unibas.org accepts them.

If there is some complaint about data, Unibas forwards that complaint to the responsible site. If the site can not (or will not) conform to Unibas.org's policy, it will be taken responsibility, and Unibas.org again will be itself responsible for that EE. The first thing it does, then, is make the EE conform to its policy again.

To be more concrete: Every site gets its own site ID. Responsibility to an EE is handled per EE with a field "siteid" that has the ID of the site that is responsible for this EE (e.g. for documents, nodes, persons, maybe words, notions). For nodes, if siteid is NULL, it is assumed the same as for parent node. For quotations, the implied siteid is that of the node.

So far for existing EEs. But what if a site creates new EEs that Unibas.org has no possibility to register?

Every site that asks, gets an ID range (maybe different per entity). To create new EEs, it uses these IDs. Later, the siteid of an EE may be changed, i.e. responsibility may be transferred to Unibas.org or another sub. A site may only change EEs it is registered responsible of. If it changes other EEs, then these changes will not be accepted by Unibas.org.

For nodes, additionally another rule applies: A site may create only nodes whose IDs are in the given range, and that are descendants of a node this site is responsible of.

Operations include: find out which records to upload/download

Internal structure

The Browser interface

Inter-site structure

TODO

The central site Unibas.org

The relationship of other unibas sites to Unibas.org

Unibas.org data exchange with other open-content sites

Data transfer between unibas sites

A site may create only nodes whose IDs are in the given range, and that are descendants of a node this site is responsible of. In other words: If a site s wants to upload some nodes to Unibas.org (=u), every node is accepted or rejected according to the following algorithm:

is there u.node with s.node.id = u.node.id?
if yes, get u.node.siteid
if it is uploader ID, => accepted 

if no, set p=s.node.parent; if p=NULL => rejected

repeat with p until accepted or rejected

Data transfer between child and parent sites is realized as so-called dbpatch files. A dbpatch file is a tar .tgz archive with the following structure:

pgdump/: directory with pg_dump files

datafile: file containing attribute=value pairs

Every pg_dump file has the name of the corresponding transfer table, that is

__export_nn_tablename

where nn is the destination site id and tablename is the real table name, e.g. "node" or "document".

These dbpatch files are created by the createdbpatch bash script, which calls the createDBPatch SQL function.

createDBPatch calls prepareTransferTable which creates the temporary tables like this:
for each shared table, it selects records to be transferred from the "real" tables
it inserts these records in the temporary tables
Of course, what matters is the selection criteria. It takes into account:
last change time of record
host responsible for the record
last sync date/time

Upload and download

documents are exchanged, objects not.
cp entries and real files come with documents
real files are refered by cp entries

every exchanged entity gets a lastchange field
(quotation records: also "expires" field?)

main has a table "site" where the unibas sites are listed.
fields e.g. lastupload, lastdownload ("up" means to main)

upload and download are initiated by sub
(or subscription for download?)

to download, the sub requests a download.
then the main prepares the download as follows:
- from all exchanged entities, 
     select all fields that are not in sub's resp and
        are newer than last download to this sub
     put them in dbpatch
     for documents, add the files/subdirs
        (for this, function makeListAvailable - requires interaction)
- set lastdownload datetime to now
- main sends dbpatch to sub; the install time of dbpatch is not important,
  but the time dbpatch is created

IMPORTANT: INSERTions and DELETions must be applied in temporal order;
   otherwise there will be a delete without effect, and 
   a record that should not exist

sub remembers time of last upload e.g. in site table
to upload, like download,
- from all exchanged entities, 
     select all fields that are in sub's resp and
        are newer than last upload
     put them in dbpatch
     for documents, add the files/subdirs
- set lastupload datetime to now
- include this lastupload datetime in dbpatch
- sends dbpatch to main

upload format:
bzip2 file with:
- pgdump of entity/rel.ship table extracts (using old spinoff technology)
	see spinoff_create and spinoff_install
- datafile with lines attr=value, e.g. containing uploadtime
- "documents" directory with documents

receiver:
- drops records in received tables

Down:
select * from node where noderesp(id)!=2 and lastchange>(select lastdownload from site where id=2);
select * from quotation where lastchange>(select lastdownload from site where id=2) and nodeid in (select id from node where noderesp(id)!=2);

Up:
select * from node where noderesp(id)=2 and lastchange>(select lastdownload from site where id=2);
select * from quotation where lastchange>(select lastdownload from site where id=2) and nodeid in (select id from node where noderesp(id)=2);


??? what if record is deleted? How to propagate deletions?

to install dbpatch (which now can contain updates not only inserts),
- for each entity 
     create identical empty temp. table,
     copy records of this entity from dbpatch into temp. table,
     delete from permanent table where id in temp. table
     insert into permanent table select * from temp. table


(extension:
?? Should subs be able to restrict their fields of interest?
E.g. only certain topics (and their subtopics), or
only some languages in dictionary etc.?
If so, how to implement?
-> Maybe as extension. 
Table sub-prefs:
Fields sub, entity, id, action

if entity is node and action is exclude, then all subnodes are excluded
  (and quotations) (and notions?)
if entity is node and action is include, then 
   all parent nodes are included 
   (and quotations, but not their subnodes)
   this is to ensure that top of tree is available
if entity is language and action is exclude, then 
   all words of this language are excluded

defaults:
   default is include
   if there is include statement for an entity, then default is exclude
)





Exchanged entities are:
node
quotation
document (if not private)
person (if not private)
word
notion
not: object, room


relationship tables? ???





Questions:
- deletions
   -> include a list of all IDs of ee's in sub's resp;
      in main, delete those not in list
      problem: main->sub direction: too many
   -> other solution: remember every deletion by 
      centralizing deletions
- relationship tables, esp. between ee's of different resps
   e.g. person and document:
    partial solutions:
      - don't accept from subs, only main (one-way)


- reuse of IDs
   -> not necessary, there are enough



is there unique ID for rel.s.table?
   - if not, simply add new rel.ship
   - if yes: treat like entity
       if all entities from same sub: resp is sub
       if different resps: resp is main



for some relationship tables, only one


cross-updates without main?

sub1 knows what sub2 has?
examples: class. sheet music etc.



Main target: node + quotations, documents, persons

fields of interest:
subs tell what they are interested in


issue warning if trying to edit EE that is in someone other's resp.
However, you can steal resp if you know what you're doing.

Creating dbpatches

Transferring dbpatches

Installing dbpatches

Data structure of intermediate tables

SQL functions involved

Functions:
changeID()       changes ID of EE and in dependencies
dependencies()   lists dependencies
unify()          unifies two EEs and actualizes dependencies
split()          opp. of unify, splits an EE in two
nextID()         centralized creation of new ID
                 taking care of allocated ID range
createEE()       creates EE using nextID() and returns oid
		 using sequence; 
		 sequence nextvalue must be within limits for sub
		 and must be greater than max(id)

TODO

update __export_1_node set siteid=$siteid;
select * from __export_$thisid_node e, node n where e.id=n.id;
reject where e.siteid != n.siteid -- i.e. where n.siteid != $siteid
accept where e.siteid  = n.siteid -- i.e. where n.siteid  = $siteid

as long as there is undecided node in __export_$thisid_node do
   select * from __export_$thisid_node p, __export_$thisid_node e
      where e.parent = p.id 
   reject e where p is rejected
   accept e where p is accepted
loop max. 50 times

if there is still undecided node in __export_$thisid_node do
   issue a warning



upload from uchi:
   upload
install in kaisha:
   with WinCmd, copy from strato/trans/kaishae to prj_work/dl
   ~/bin/inst2k

upload to strato from kaisha:
   uchie
   and copy with WinCmd to Strato
install in uchi:
   inst2u

todo:

rename lastupload into lastsend etc.
define interests per site:
   - documents
   - node + quotation
   - geographical?
with this info 
   - send customized info to partner sites
   - select from input streams (TV, radio, WWW, ...)

other site-specific infos:
   - transfer way in both directions (FTP/HTTP/mail...)
   - addresses for transfer (URL/IP, ...)
   - password if necessary


need complete site list and table with fields: siteid,entityid,min_id,max_id
automatized installation incl. initial DB and PHP, and qmmslog
For this purpose:
- create a virgin partition and another partition of equal size for its copy
- install standard Suse; create copy
- run and test installation program
- modify it
- restore original
- run and test installation program again...


OK, now I can "upload" from testdb on merlin to unibas on merlin.

select id, nodeid, substring(lastchange::text,1,19) as lastchange, substring(text,1,30) as text_begin from __export_2_quotation;


todo: deletions list
      checks at installation time: do not accept everything
      dp1 format description

test scenario from james to unibas:
bash/createdbpatch 1 .james dbpatch.dp1
bash/installdbpatch .unibas dbpatch.dp1       

default should be:
1 Unibas installation per host, ~/.unibas is this one local installation


for new Unibas site do:
psql unibas; insert into site values ... # record of new site
pg_dump unibas >unidump; createdb new; cat unidump | psql new
cp -R ~/.unibas ~/.new  # for settings
edit ~/.new/settings; dbname=new
psql new; update param set val=(new id) where attr='siteid';
psql new; i sql/transfer
ID range for new records

# for upload from nelly2 to nelly (.unibas)
bash/createdbpatch  5 .nelly2 dbpatch.dp1
bash/installdbpatch .unibas dbpatch.dp1 

# for download from nelly (.unibas) to nelly3:
bash/createdbpatch 10 .unibas dbpatch_dl.dp1 
bash/installdbpatch .nelly3 dbpatch_dl.dp1 

OK, simply uploading and downloading works.
Remains to do:
- Checks (faked dbpatches etc.)
- autom. creation of new subsites
- hierarchy over several steps (unibas - nelly - nelly2)
- deletion
- ...

Unibas development and technical stuff

Unibas is written in Python/PyQt. We usually follow the recommendations about Python style, e.g. from http://effbot.org/zone/import-confusion.htm.

Unibas is designed to work with data in all writings and languages, so we use Unicode/UTF-8 for text in the database, filenames, and program text. A Unibas Python file usually begins with:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Text in Unibas usually means Unicode text. Unibas' support for legacy codings (such as ASCII, Latin1, etc.) is very limited. Where we do support legacy codes is with infosources. Unibas reads HTML files in legacy codes, but converts their data to Unicode for internal processing.

Comments are written in RestructuredText. Unibas files usually contain the following declarations:

__author__ = "Volker Paul"
__copyright__ = "Copyright 2010, Volker Paul"
__license__ = "GPL 2"
__version__ = "0.4"
__maintainer__ = "Volker Paul"
__email__ = "volker.paul@v-paul.de"
__status__ = "Development"
__docformat__ = 'restructuredtext'

Of course these will be adapted to different authors, maintainers etc. if necessary.

The Unibas GUI consists of one main window which provides help and menus, and different forms the user can open if necessary. These forms present the data from the underlying database tables and provide functionality in pull-down and context menus.

Qt provides many other GUI goodies such as: internationalization (i18n) of user dialogs, toolbars, drag&drop, etc; Unibas does not use most of them. Only when the core functionality works reliably we can think about these luxury items.

By then, Unibas will probably be ported to PyKDE. But till then, a lot of works still remains to be done.

Specifications

old, obsolete?

Sequences: node_id_seq
From /usr/doc/cpp_tut/annot/cplusplus11.html:
   The ofstream object returns 0 if its construction failed.  
   This value can be used in tests, ...
So UBDataSet should also return 0. But what does it mean
"an object returns 0"?...
See annot 4.7 for header file organization and order
Maybe operator bool () {}? (annot 6.5)
ID returns false if id <=0, e.g. if entity of nonexisting ID wanted. Usage:
UBDocument d(12); if ( d.id ) cout << d.title << endl; else cout << "No record
";

idea: collect all Unibas files in unibas.cpp?
Anyhow, 
class entity as parent of document, object, ...
common:
toString, uniqueDescription and other ideas from Access Unibas 2.0

naming conventions
see lib/document.cpp as standard implementation
table names that begin with _tmp_ are reserved for temporary tables
only main program resides in unibas, 
classes are in unibas/lib, 
includes in unibas/include
suggest (later) to put all Unibas class files in one library

suggest using Qt convention; if needed, prefix "UB" to get
e.g. UBString if an own String class is needed
If UniBas files need to be distinguished from existing other (e.g. executable)
files, give them a "ub_" prefix. Similarly, prefix more general files
with "vp_".
Note that the class prefixes are not "ub_", "vp_" but "UB" and "VP", resp.

functions: toString, getName etc.; begin first lower, next word uppercase
classes named UBXyz, if xyz is something Unibas-related, and
VPXyz, for something more general like Regexp and TextAnalysis
abbreviations are allowed, e.g. "N" for number, "Exp" for expression, 
"Sub" for sub-parts
add UsageExample to every class, as in VPRegexp
(static function? if problematic, suggest names like VPRegexpUsageExample)

it seems to be standard to use a function toString to produce
the string representation of an object (in Java and Qt at least).
So every entity (class) will get the function toString.
The class member is what's output; the argument of toString tells
which aspects are output in what detail. 
E.g. for an author (person), his works should be sorted by importance, 
the argument could tell to output documents up to a limit of 5, 
after that "and n others", where n is the number of the remainder.

The output of toString is primarily intended for human readers
and formatted accordingly. It is not a CSV or similar representation.

The copy table is named cp because copy is a keyword.

Abbrevs (like pers for person) are deprecated.
(often used tables (mostly entities) had abbrevs,
document:doc; person:pers; marriage:marr; address:addr; company:comp;
object:obj)

ids in foreign keys are composed as name+"id":
personid, documentid, ...

The old "X" (selection) field is renamed sel, if needed.

Indexes are seldom used explicitly. Their names can be longer.
They should be named as follows:
for ID: name_pkey, e.g. word_pkey
create index notion_notion on notion(notion);
create index persaddr_addrid on persaddr (addrid);
CREATE  INDEX "cp_objectid" on "cp"  ( "objectid"  );
Postgres 7.0+ knows primary keys. 
NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'doc2_pkey' for table 'doc2'
So explicitly creating ID index not needed?

create unique index address_id on address (id);

concise views are named "v_" and the abbrev of the table, e.g. v_doc

ids are incremented, sequence is named tablename_id_seq, e.g. document_id_seq
("seq_"+abbrev+"id", e.g. seq_docid is old and deprecated)

For each entity table OLD_ENTITY taken over:
- Define new name ENTITY, short and in lowercase, and abbrev ENT,
  may be same as ENTITY; ENTITY_ID becomes id
- get "Select max(id) from OLD_ENTITY;", add 1 for START
- create sequence seq_ENTid start START;
- alter table OLD_ENTITY rename to temp;
- dump structure as follows, copy and paste, edit for new structure:
  ! pg_dump -s -t temp unibas
- create table ENTITY ( id int4 default nextval('seq_ENTid'),
          descr text, ...);
  do not change field order, add or delete fields!
- insert into ENTITY select * from temp;
- create unique index ENTITY_id on ENTITY (id); (not ENT_id!)
- drop table temp;


CheckEntity checks for a given entity if all rules are implemented,
e.g. there is a sequence, ids are unique, etc.
files unibas.h, unibas.cpp are for Unibas as a whole;
every entity has its own file


List of old style entity tables
company
Marriage
Node


List of converted entity tables (with sequences and indexes)
document
object
room
person
address 



List of views:
see file views

create view v_addr as select p.givname, p.famname, a.* from person p, address a, persaddr pa where pa.persid=p.id and pa.addrid=a.id;

create view v_rel as select e1.id as e1id, e1.name as e1name, id1, ord, relid, notion, e2.id as e2id, e2.name as e2name, id2 from rel r, notion n, entity e1, entity e2 where n.id=r.relid and e1.id=ent1 and e2.id=ent2;


general table "rel":
fields
      ent1 id1 order relid ent2 id2
e.g.  doc  123   1   4711  pers 234
meaning:
document 123 is biography of person 234 
(notion 4711 is biography, here interpreted as: "is ... of")
This approach is relation-independent and table-independent,
all binary relations can be coded this way.

Since 24.6.2001: rel is represented as follows:
CREATE TABLE "rel" ("ent1" int2,"id1" int4 NOT NULL,"ord" int2,"relid" int4,"ent2" int2,"id2" int4 NOT NULL);

And there is the perhaps most important relation:
CREATE TABLE "entity" (	"id" int4 DEFAULT nextval('entity_id_seq'::text) NOT NULL,
"name" text NOT NULL,"abbr" text,"descr" text,"tostring" text NOT NULL,"detail" text NOT NULL,"mainfield" text);
Abbrevs will be dropped since they are of little use.

In entity tables, where there is NOT NULL there must be a DEFAULT value.


create table word (id serial primary key, word text not null, lan text not null, gram text, pron text);
create table node (id serial primary key, parent int4 references node, title text not null, descr text);
create table notion (id serial primary key, notion text not null, explanation text not null, nodeid int4 references node);

Minimalistic plpgsql function:
drop function plural(text); create function plural (text) returns text as 'begin return $1 || ''s''; end;' language 'plpgsql'; select plural('test');
work/log

Unibas development

The UBInstanceWidget class represents an instance of an entity on screen.

Can be used to create a generic widget, but usually subclassed e.g. from UBDocumentWidget.

The UBInstanceWidget class was based on QDataBrowser which was badly documented and not completely mature. From version 4.x of Qt, QDataBrowser has been dropped, but before Qt 4.1, nothing replaced it.

From http://www.trolltech.com/products/qt/whatsnew/porting/readyforqt4/:

Data-Aware Forms

The QDataTable, QDataBrowser, and QDataView classes in Qt 3 allow integration between widgets and SQL-based databases. In Qt 4, the preferred way to create a data-aware widget is to connect an generic item view (such as a table view) to a SQL model. New data-aware widgets are planned for Qt 4.1. New applications written with Qt 3 should use QSqlQuery in preference to an approach based on data-aware widgets. This offers a choice of porting strategies when later migrating the application to Qt 4: You can either continue to use QSqlQuery or take the opportunity to use the model/view classes to handle database integration.

This is about issues, trouble, etc. in Unibas development.

ToDo: - clear DB of obsolete tables - naming schema for temporary tables to be deleted automatically - license: different fields that tell if it is allowed to - share document info (record) - share document content (file) - see in newer version of Qt if error described below still occurs - revive document creator - turn groups of lines into a (itemized, ordered, simple) list - turn groups of lines with comma/semicolon-separated items into a table - external links (ulink) etc.; blockquote yet broken

Adding widgets

Adding widgets

http://women.kde.org/articles/tutorials/kdevelop3/creating.html

 Qt Designer manual: http://doc.trolltech.com/3.2/designer-manual.html

A form from Qt Designer is saved with a .ui extension because it is then processed by the uic program to generate a .h and .cpp files. KDevelop takes care of that, the only thing we need to do is to create the form.

Each time you add or remove a file (now we will add a .ui file) in your KDevelop project, the Makefile.am will be changed. KDevelop does that itself but you will need to Run automake and friends after such changes.

Naming:
generated class: MyNameDlg, files mynamedlg.h, .cpp
subclass: MyName, files myname.h, .cpp
UI XML file: mynamedlg.ui

File -> New 
Directory: ~/data/prj/unibas/unibas
file: mynamedlg.ui
Type: Blank file (currently no other choices, literature says "Widget (.ui)")
x Add to Project


Please be sure that "Add to project" is checked, this ensures that the Makefile.am is updated.

OK; The Automake manager dialog then pops up.
Leave as is, i.e. Add new files to my active target. Click the OK button.

"If the new file newdlg.ui is not open in Qt Designer, open the Automake Manager, right-click on the file and select Open With then you should have Qt Designer. Choose it. Qt Designer will open." (In its own window.)


"A simple click in Automake Manager
on newdlg.ui opens it with Qt Designer 
in a subwindow of KDevelop."
This is true, but Qt Designer still doesn't save on F8 and 
it has very little space inside KDevelop.
So better use it in a separate window:
Open the .ui file as text, right-click on the tab and open it with Qt Designer.

Change name (MyNameDlg) and caption (user-readable text) of the form.
Create OK button and connect it to form's close(), etc.

"In this tutorial, we use KDevelop subclassing tool in Automake Manager. "

"Run automake and friends" (error due to different versions;
instead, run configure, then build).
Success.




TODO: adapt this:

In the section unibas (Program in bin) you must have three files: 
mynamedlg.ui, 
mynamedlg.cpp and 
main.cpp.

"In the Automake Manager, in unibas (program in bin), right click on mynamedlg.ui and choose Subclass Widget..."
There is no "Subclass Widget", but Subclassing Wizard, choose it;
give subclass, here MyName
wizard proposes filenames myname.cpp and myname.h, should be OK
Specialize accept() and reject()
Click Create, in next dialog: OK.
Then F8 -> success.
myname.cpp appears in unibas (program in bin)

Main Menu:
Adding a menu item needs changes at 7 places:
1. unibasui.rc, e.g.     Action name="tools_MyName"
2. unibas.h, Unibas specific slots, e.g. void slotMyName();
3. unibas.h, Unibas specific actions, e.g. KAction* toolsMyName;
4. unibas.h, class MyName;
5. unibas.cpp, in initActions
6. unibas.cpp, slot implementation
7. unibas.cpp, #include "myname.h"
 
Note: Tried to use Qt menus instead of KActions;
      Not so simple to change given structure, had segfaults
      so sticking with KActions for now.
Also, #include "ubalbummanager.h" in unibas.h


Troubleshooting signals and slots:
If the program doesn't react on a signal, 
maybe you gave variable type and name like this:
    connect( ownerType, SIGNAL( activated(int i) ), this, SLOT( makeLenderList(int i) ) );

The program will work if you use this correct syntax:
    connect( ownerType, SIGNAL( activated(int) ), this, SLOT( makeLenderList(int) ) );

subdirectories:
an attempt to use subdirectories failed. KDevelop tried to make
a library of the subdirectory but failed.
After that, manually moved the files from subdir (unibas/unibas/documedia)
to unibas/unibas.
Some changes in unibas.kdevproj
So for now, it's safe to use the procedure described above.
DO NOT:
- use subdirectories
- use QtDesigner's inline function feature

Another bug in QtDesigner:
If your Postgres database uses schemas, and uses a name in several schemas,
e.g. schema1.table_a and schema2.table_a, 
then the Data Browser Wizard will show all fields
in both table_a tables, i.e. it doesn't handle schemas correctly.

Troubleshooting Unibas

asda

ubalbummanager.cpp:22: error: invalid use of undefined type 'struct QTable'
ubalbummanager.h:16: error: forward declaration of 'struct QTable'
Although this class (UBAlbumManager) is a subclass of UBAlbumManagerDlg
and UBAlbumManagerDlg already usesQTable.
Reason: ubalbummanagerdlg.h does not #include qtable.h.
Instead, the *dlg.cpp has a
#include qtable.h
line and the *dlg.h file only has a
class QTable;
line.
Solution: ubalbummanager.cpp needs to #include qtable.h also.
ubalbummanager.h needs neither class QTable; nor #include qtable.h.

Changes in design by Qt Designer do not show up in program.
Reason: Qt Designer is an external tool, not integrated in KDevelop.
If you edit a source file in KDevelop and press F8 to build,
your changes are automatically saved before building.
If you make changes in Qt Designer and press F8 in KDevelop to build,
your changes are NOT automatically saved before building.
You must save explicitly in Qt Designer.

Error:
ubalbummanager.o: In function `UBAlbumManager::~UBAlbumManager()':
ubalbummanager.cpp:(.text+0x8): undefined reference to `vtable for UBAlbumManager'
Reason: in generated derived widget/dialog classes, at the end there is a
#include "ubalbummanager.moc"
line. If it's missing, this error appears.


Signals and slots:
Error: a slot is not called when it should be.
Reason: Maybe misspelled slot name. This is not detected at compile or link time
but when you emit the signal. Then, on stderr or stdout, this appears:
QObject::connect: No such slot UBBurner::stepx()
QObject::connect:  (sender name:   'stepButton')
QObject::connect:  (receiver name: 'UBBurnerDlg')
Start Unibas from console to see this output.
Similarly,
QObject::connect: No such slot UBBurner::startAll()
appears when startAll() is a function but no slot.


Connect error:
ubalbummanager.cpp:23: error: no matching function for call to 'UBAlbumManager::connect(QPushButton*, const char [11], QTable*, const char [14])'
/usr/lib/qt3/include/qobject.h:116: note: candidates are: static bool QObject::connect(const QObject*, const char*, const QObject*, const char*)
/usr/lib/qt3/include/qobject.h:226: note: bool QObject::connect(const QObject*, const char*, const char*) const

Answer from http://lists.trolltech.com/qt-interest/2002-07/msg00174.html
via google:
I assume you hadn't included qpushbutton.h -- the compiler 
doesn't seem to know that those classes derive from QObject or QWidget.



ubinstancewidget.moc.o:(.gnu.linkonce.r._ZTV16UBInstanceWidget[vtable for UBInstanceWidget]+0x1e0): undefined reference to `UBInstanceWidget::layout0()'
ubinstancewidget.moc.o:(.gnu.linkonce.r._ZTV16UBInstanceWidget[vtable for UBInstanceWidget]+0x1e4): undefined reference to `UBInstanceWidget::layout1()'
ubinstancewidget.moc.o:(.gnu.linkonce.r._ZTV16UBInstanceWidget[vtable for UBInstanceWidget]+0x1e8): undefined reference to `UBInstanceWidget::layout2()'
ubinstancewidget.moc.o:(.gnu.linkonce.r._ZTV16UBInstanceWidget[vtable for UBInstanceWidget]+0x1ec): undefined reference to `UBInstanceWidget::layout3()'

solution: 
    virtual void layout0(){};
instead of
    virtual void layout0();




ubdocumentwidget.moc.o: In function `UBDocumentWidget::primeInsert(QSqlRecord*)':
ubdocumentwidget.moc.cpp:(.text+0x4b0): multiple definition of `UBDocumentWidget::primeInsert(QSqlRecord*)'
ubdocumentwidget.o:ubdocumentwidget.cpp:(.text+0x2a10): first defined here

because
    virtual void primeInsert(QSqlRecord * buf);
mistakenly under 
signals:
instead of
protected slots:



This had been done as follows, yielding an error:
msgbox("without ::int");
    sql = "select oid from rel where ent1="+olde1+" and id1="+oldid1
     + " and relid="+oldrelid
     + " and ent2=" + olde2 + " and id2=" + oldid2;
logger->debug("reled7a: without ::int: sql="+sql);
    QString oid = readSQL(sql, TRUE, TRUE);
    logger-debug("yields " + oid);
    msgbox("yields " + oid);
The error was:
QPSQLResult::data: unable to open large object for read
Apparently the returned value is of type oid and
QPSQL7 expects this to be a BLOB (Binary Large OBject).
The returned QString oid is empty.
The problem can be solved by casting oid to int, see below.



ubcdripperdialog.cpp:(.text+0x29): undefined reference to `vtable for UBCDRipperDialog'

Solved by adding ubcdripperdialog.h ubcdripperdialog.cpp to EXTRA_DIST = ...
in unibas/unibas/Makefile.am

Solution found after this hint:

 I'm getting a link error, undefined reference to 'vtable and I
 understand it is a fairly common problem.

You probably forgot to moc a file. See 
http://doc.trolltech.com/3.1/commonproblems.html#linkerror

in http://lists.trolltech.com/qt-interest/2003-07/thread00243-0.html



A statement like this:
select id::text || ': ' || substr(text,1,30) from quotation where node.id=745 and node.id=quotation.nodeid;
is broken, but PostgreSQL may answer with:
NOTICE:  adding missing FROM-clause entry for table "node"
or it may output an error message.


Problem: external USB HD not automatically mounted
Solution: unplug it, restart haldaemon and plug it in again



Problem: file not included in project
Solution: add it to unibas/unibas/Makefile.am


Problem: setMinimumWidth() is ignored, width is too small.
    setCaption("Relation editor");
    setSizePolicy(QSizePolicy::Minimum, QSizePolicy::Minimum);
    setMinimumWidth(800);
Workaround: set width of child widget
    ent1-setSizePolicy(QSizePolicy::Fixed, QSizePolicy::Fixed);
    ent1-setMinimumWidth(800)


Problem: a widget (here rel) is  below relLabel, and not right of relLabel, as it should be
    QHBoxLayout* relLayout = new QHBoxLayout(topLayout);
    QLabel* relLabel = new QLabel(this);
    relLabel->setText("Relation:");
    relLayout->addWidget(relLabel);
    rel = new QComboBox(this);
    relLayout->addWidget(rel); 
// ??? Why is rel below relLabel, but ordWidget is right of ordLabel?
Reason: below was the line:
    topLayout->addWidget(rel);
which added rel to topLayout and thus effectively took it out of relLayout.


Problem: A key is pressed, and Unibas hangs, presumably in an infinite loop.
Reason: The key event handler may be broken:
void UBTrackListWidget::keyPressEvent( QKeyEvent *k )
{
    if ( k->key() != Key_Alt && k->key() != Key_Control && k->key() != Key_Shift ) {
        if ( k->key() == Key_F9 || k->key() == Key_Menu ) 
            showContextMenu(currentItem());
        else QListView::keyPressEvent( k ); // if not menu call, pass key events through
    }
}

If instead of QListView::keyPressEvent, you just write keyPressEvent,
then UBTrackListWidget::keyPressEvent is called again and again.
What is intended here, is that only a few keys are handled by 
UBTrackListWidget::keyPressEvent, and all the others are handled by 
QListView::keyPressEvent. If you don't prepend QListView::, then
implicitely the current class, i.e. UBTrackListWidget:: is prepended.



Problem:
QObject::connect: No such slot UBTrackListWidget::onCurrentChanged(QListViewItem*)
QObject::connect:  (sender name:   'trackListWidget')
QObject::connect:  (receiver name: 'trackListWidget')
Reason:
    void onCurrentChanged( QListViewItem * );
was under public, not under public slots.


Problem:
QObject::connect: No such slot QWidget::urlClick(const QString&)
QObject::connect:  (sender name:   'browser')
QObject::connect:  (receiver name: 'unnamed')
QObject::connect: No such slot QWidget::setID(ID)
QObject::connect:  (sender name:   'UBObjectWidget')
QObject::connect:  (receiver name: 'unnamed')
Reason: missing Q_OBJECT




Problem: crash for unknown reasons; backtrace and logs contradict
Reason: maybe inconsistency; rebuild all (clean? configure?)

Problem: Cannot remove file (whose name contains non-ASCII characters)
Solution:
export LANG=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
and restart unibas

30 May 2007 19:43 ubburner DEBUG percentString = 100 30 May 2007 19:51 ubburner DEBUG Step 30 May 2007 19:51 ubburner DEBUG a:unibas 30 May 2007 19:51 ubburner DEBUG b:unibas 30 May 2007 19:51 ubburner INFO update object set ot='DVD', descr='Down Under: Australien - Neuseeland', roomid=1176 where id=10845; 30 May 2007 19:51 ubburner DEBUG Trying to commit... 30 May 2007 19:51 ubburner DEBUG After commit. KCrash: Application 'unibas' crashing... unibas=# i sql/room psql:sql/room:45: WARNING: nonstandard use of escape in a string literal at character 38 HINT: Use the escape string syntax for escapes, e.g., E' '. QUERY: SELECT $1 || coalesce( $2 ,'') || ' ' CONTEXT: SQL statement in PL/PgSQL function "f_rooms" near line 16 CREATE FUNCTION Solution: replace ret = ret || coalesce(l,'') || ' '; by ret = ret || coalesce(l,'') || E' '; QPixmap: Invalid pixmap parameters QPainter::begin: Cannot paint null pixmap QPainter::end: Missing begin() or begin() failed Happened when developing UBRippedFilesArchiver with QtDesigner. Disappeared after changing the .ui file, saving it and recompiling. Exact reason unknown.

mkisofs

Various strange error messages make it seem as if mkisofs,
if called from Unibas, behaves differently as if called from shell.
It seems options which need an argument must be given with 
two addArgument statements.
If your problem is this one, then most error messages are misleading.
So the exact reason is unknown, but the solution is
to use two addArgument statements.

// as we use UTF-8, we need to tell mkisofs that
// filenames also are in UTF-8
// however, if we use this option, mkisofs mistakenly thinks
// it is the beginning of the -i option:
// mkisofs: -i option no longer supported.
//     proc->addArgument( "-input-charset UTF-8" );
// This version (2 minuses) is unknown if called from here,
// but known if called from shell
//     proc->addArgument( "--input-charset UTF-8" );
//     proc->addArgument( "--output-charset UTF-8" );
// So as a workaround, for the moment, 
// we must set environment variables, e.g. in .bashrc:
// export LC_CTYPE=en_US.utf8
// export LANG=en_US.utf8
// if we have UTF-8 encoded files, but mkisofs assumes
// some other encoding, it may fail to find the files,
// and issue the following error message:
// INFO:   utf-8 character encoding detected by locale settings.
//         Assuming utf-8 encoded filenames on source filesystem,
//         use -input-charset to override.
// mkisofs 2.01 (i686-suse-linux)
// mkisofs: No such file or directory. Invalid node - /unibas_archive/oid10444/Childrens Day in Bullerb.mp3

    proc-addArgument( "--input-charset" );
    proc-addArgument( "UTF-8" );


    proc-addArgument( "-graft-points" );
// This doesn't work, yields "/usr/bin/mkisofs: No such file or directory. Unable to open disc image file"
//     proc->addArgument( "-o " + isoFile );
// in 2 lines, it works
// It seems options which need an argument must be given with 
// two addArgument statements.
    proc-addArgument( "-o" );
    proc-addArgument( isoFile );

Fragments

Fragments about Unibas development

Releases

Unibas releases are created with setuptools and the following setup.py script:

from setuptools import setup, find_packages

setup(name='Unibas',
      version='0.4',
      description='Universal Database Application',
      author='Volker Paul',
      author_email='volker.paul@unibas.org',
      url='http://unibas.org/',
      packages=['unibas'],
      package_dir={'unibas': 'src/unibas'},
      package_data={'unibas': ['data/composers.*']},  # is ignored by sdist, bug in distutils
      data_files = [('unibas', ['src/unibas/unibas_archive.zip','src/unibas/minidb.pgdump'])],
      license="GNU GPLv2",
      scripts=['src/unibas_gui.py','src/unibas_cli.py','src/unibas_init.py']
     )

Setuptools is an enhancement of distutils. Distutils alone is bad, ignores package_data and data_files for sdist.

Setuptools includes data in egg if called as follows:

python setup.py bdist_egg

See http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-install.

prj/unibas04$ rm MANIFEST; python setup.py sdist

Distutils bug/missing feature. MANIFEST is a list of files to distribute. Unfortunately, if the MANIFEST file exists and module files have been added or deleted, MANIFEST is not updated. So MANIFEST is first deleted in the above command. Then the command tries to create a source package with the setup() function. This creates a Unibas-0.4.tar.gz file in the dist directory. It contains the source files but not the package_data and data_files mentioned in the setup() call. There is no mention of this in the official distutils documentation, so I had to find this out after a long search of the mistake in my setup.py script. But it's not in my setup.py script, it's in distutils. As it's not mentioned in the documentation and there is no logical reason for it, I must call it a bug. See also: http://mail.python.org/pipermail/distutils-sig/2004-February/003733.html and http://mail.python.org/pipermail/new-bugs-announce/2008-March/001170.html:

distutils.sdist.add_defaults adds the Python modules and scripts and C extensions found in setup.py to the MANIFEST. It does *not* add data_files mentioned in setup.py to the MANIFEST. This is non-orthogonal and confusing, because it means that a MANIFEST.in is required if you have data_files, optional if you do not. If you have data_files and do not have MANIFEST.in, you get a broken package but no error.

And: http://bytes.com/groups/python/23497-distutils-question-why-data_files-listed-setup-py-being-ignored-how-fix Then I tried rm MANIFEST; python setup.py bdist_rpm Got the error: creating /var/tmp/Unibas-0.4-1-buildroot/usr/unibas_archive.tgz error: can't copy 'unibas_archive.tgz': doesn't exist or not a regular file error: Bad exit status from /var/tmp/rpm-tmp.73422 (%install) Maybe its reason is the same bug in distutils. (Remark: Later I found out that maybe this can be solved by writing a MANIFEST.in template, see Python-Docs-2.5/dist/manifest.html.)

Solution: setuptools. Then I turned to setuptools which seems to work better. At least it hasn't the data_files bug when creating an "egg".

rm MANIFEST; python setup.py bdist_egg

creates Unibas-0.4-py2.5.egg in dist/.

easy_install --record /tmp/rec Unibas*.egg

Processing Unibas-0.4-py2.5.egg
Removing /usr/lib/python2.5/site-packages/Unibas-0.4-py2.5.egg
Copying Unibas-0.4-py2.5.egg to /usr/lib/python2.5/site-packages
Unibas 0.4 is already the active version in easy-install.pth
Installing unibas_cli.py script to /usr/bin
Installing unibas_gui.py script to /usr/bin

Installed /usr/lib/python2.5/site-packages/Unibas-0.4-py2.5.egg
Processing dependencies for Unibas==0.4
Finished processing dependencies for Unibas==0.4
writing list of installed files to '/tmp/rec'
nelly:/home/nslu/volker/data/prj/unibas04/dist# cat /tmp/rec
/usr/lib/python2.5/site-packages/Unibas-0.4-py2.5.egg
/usr/bin/unibas_cli.py
/usr/bin/unibas_gui.py

This works, installs all data files, and the scripts, and they work. The scripts installed in /usr/bin are only wrappers. The real scripts remain in the egg which is installed in /usr/lib/python2.5/site-packages/Unibas-0.4-py2.5.egg

Remains the question: How can my program access the data? See archive.py for an example.

To summarize, in prj/unibas04: % python setup.py bdist_egg # easy_install dist/Unibas*.egg Add the option --record /tmp/rec to easy_install in order to store the messages at installation into the file /tmp/rec. Needs package python-setuptools.

Postgres as used by Unibas

This section is about the underlying RDBMS, PostgreSQL (sometimes also simply called "Postgres").

PostgreSQL is free, powerful and comes with many features hard to find on other RDBMSs, that's why Unibas is quite closely tied to PostgreSQL and it is unlikely Unibas will ever support different RDBMSs. PostgreSQL is thoroughly documented, on its home page www.postgresql.org and in the Unibas knowledge tree. PostgreSQL also comes with pgAdmin III, an excellent maintenance tool. Because of this, when it comes to maintaining Unibas' underlying PostgreSQL, we usually refer to the PostgreSQL documentation and tools. We could not do it any better.

Care must be taken when working with multiple schemata. Some clients may not properly handle the case when there are several tables with the same name in different schemata.

File-DB transactions

Many database transactions involving documents are accompanied by file operations. For example, when a new document is created from a file, a document table entry is created and a file is moved (or copied). Each of these operations is uncomplete without the other part(s), leading to inconsistency. We call the whole of these operations a file-DB transaction. Unibas tries to make sure that file-DB transactions are executed atomically, i.e. as a whole or not at all.

To achieve this, Unibas uses normal DB transactions combined with crLists. A crList is a list of pairs of file commit and rollback operations. A file commit operation is a string that can be executed by Python and then completes the file operation it belongs to. For example, if the file operation is to copy a file (while keeping the original, as implemented in copyAndKeep), the commit operation is None. However, for

copyRename

, if the destination is on a different file system than the source, the commit operation is 'os.remove("%s")' % src, as the file is first copied and then, at commit time, the original is removed.

Similarly, a file rollback operation is a string that, if executed by Python, rolls back the file operation it belongs to. For example, if the file operation is to copy a file (copyAndKeep), the rollback operation is to remove the copy (destination) again:

'os.remove("%s")' % dest

To commit all operation in a crList,

utilities.commitFileOperations(crList)

is called which executes all "commit" parts of the pairs.

utilities.revertFileOperations(crList)

executes all "r" parts of the pairs.

To begin a file-DB transaction, an empty crList is created and a DB transaction is begun:

    crList = []
    db.transaction()

To commit a complete file-DB transaction:

    db.commit()
    commitFileOperations(crList)

To rollback:

    db.rollback()
    revertFileOperations(crList)

The operations in the transaction are executed in a try...except block. If an exception occurs and is not caught inside, it causes a rollback. Before the commit, the user can be asked if he really wants to commit. If he refuses, too, a rollback is done. If he agrees, then finally the commit is done.

So a typical file-DB transaction looks like this:

    db = openDB()
    crList = []
    if not db.transaction():
        errmsg("transaction() failed!")
        log.error("transaction() failed!")
        return
    try:
        # DB and file operations
        crList.append(doc.copyRename()) # creates dest if necessary
        doc.saveExtended() # calls doc.save() and there an id is got
    except Exception, s: # rollback due to SqlError or file error
        log.error(unicode("SqlError or file error: ") + unicode(s))
        log.info("Start of db.rollback() and revertFileOperations(crList)")
        db.rollback()
        revertFileOperations(crList)
        return
    else:
            # Wish: At this point it would be great if the user could see
            # all DB and file operations that will be done on commit.
            # Optionally, he might also want to see what will be done on rollback.
            # File operations are in crList, that's no problem.
            # Remains the DB operations. Perhaps a special execSql() version
            # that remembers the SQL statements?
        if QMessageBox.question(None, "DVD Wizard",
                "Do you want to commit?",
                QMessageBox.Yes | QMessageBox.No,
                QMessageBox.Yes) == QMessageBox.Yes:
            db.commit()
            commitFileOperations(crList)
        else:
            log.info("Rollback on user request")
            db.rollback()
            revertFileOperations(crList)
            return

So this block of code in DB and file operations:

    crList.append(doc.copyRename())
    a=1/0 # provoque error
    sqlList = doc.saveExtended()

leads to the following log output:

INFO:utilities.copyRename:renamed /unibas_archive/toarchive/myfile.pdf to /unibas_archive/oid12156/myfile.pdf ERROR:newDocumentFromFile:SqlError or file error: integer division or modulo by zero INFO:newDocumentFromFile:Start of db.rollback() and revertFileOperations(crList) INFO:revertFileOperations:os.rename("/unibas_archive/oid12156/myfile.pdf","/unibas_archive/toarchive/myfile.pdf")

Note the output of copyRename(), then the error message from a=1/0, then the start of the rollback code instead of saveExtended() which would normally follow. In the rollback code, revertFileOperations() executes os.rename() which renames a file back.

Creating new tables

When creating new entity or relationship tables, in order for unibas to properly work with them, one should take care of some things (suppose the new table is called newtable): run database.adapt_column_metadata('newtable') if necessary, adapt column_metadata for fields that need several lines like this:

UPDATE column_metadata SET lines=5 WHERE entity='event' AND col='description';

run checkForeignKeys() and manually add tablelinks entries if necessary, like this:

INSERT INTO tablelinks (mastertable, subtable, subfield) VALUES ('event','event','partof');

For entity tables, additionally: make an entry in entity table make an entry in the entity list in unibas_gui.py, line 133 its id should be created as follows:

id int primary key default nextid('newtable')

InstanceWidget

It is important that the form changes are saved when the user quits the form. This is done explicitly in Unibas because Qt provided no reliable solution for that. Here are my experiments:


       self.mapper.addMapping(fieldEdit, self.ee.model.fieldIndex(n))
        
       QObject.connect(fieldEdit, SIGNAL("customContextMenuRequested(const QPoint&)"), 
           self.showContextMenuxy)

OnManualSubmit: selection remains, data is stored to table, but not to DB. Here unusable, because we have no table (widget), only form.

OnRowChange: same as OnManualSubmit; only when user double-clicks on new record in table, DB is updated; unusable here

OnFieldChange: Field is updated, but highlight is lost. Next field is not updated. When there is table widget, user needs to re-select row. But we have only form, can't re-select (at least not this way).

Current solution: b2 calls:

self.ee.model.submitAll() self.mapper.setCurrentIndex(0)

on button click or on QLineEdit's signal editingFinished() emitted on tab (on Return currently, QLineEdit's content is restored from DB, which is especially annoying). I tried a lot of tricks, including programmatically sending FocusOut:

focusWidget = QApplication.focusWidget() QApplication.sendEvent(focusWidget, QFocusEvent(QEvent.FocusOut))

to mimic Tab behavior, but nothing worked. The best way to save currently is pressing Tab (NOT Enter!) or clicking on Save button (what matters here is not the button's function but the fact that by clicking it, the old widget loses focus). To sum up, this widget can be used, but with caution. Always check if its contents is really saved.

TODO: A more robust solution must be found, and several approaches (layouts) should be tested in parallel.

GenericEntity functionality

Some problems occur again and again, in almost all entity types, and part of these are solved in the GenericEntity class.

For example, often a string is given, e.g. the formatted name "Dr. Albert Einstein", and needed is this string split up into fields and their values, in the example givname='Albert', famname='Einstein' and title='Dr.'. The GenericEntity class solves this problem in its function getNameFromString().

Another problem: Given a new entity (instance), decide if and which existing entity matches it. This problem is addressed by getPossibleMatches().

These functions need some data to work. The former needs the list getNamePatterns, the latter needs the matchTryList. Both are currently stored in the entity table and must be carefully set up.

Important

This and other manipulations of the entity table must be done with extreme care. This table is essential for the functioning of Unibas.

In the following, we will discuss the mentioned functions in detail. To understand them, a good knowledge of Python and especially regular expressions is needed.

The getNameFromString() function

This function tries to read data, e.g. a person's given name (givname) and family name (famname) from a given string.

Often an entity is given, from an infosource or wherever, as a string, e.g. a formatted name such as "Dr. Albert Einstein". The task of this function is to extract data, in the example givname='Albert', famname='Einstein' and title='Dr.'. To do this, the function tries several patterns. If the input string matches such a pattern, its parts can be associated with the fields of the entity. In the example, the pattern

(?P<title>Dr.) *(?P<givname>[A-Z]w*) (?P<famname>w+)

would match.

Of course this function is not limited to persons only. this is just an example. However, for every entity type a different list of patterns is needed. Currently, these lists are stored in the get_name_patterns field of the entity table. You have been warned that you must be extremely careful writing these lists.

Technically, these patterns are regular expressions with symbolic group names defined by

(?P<id>)

See http://docs.python.org/library/re.html.

If a pattern matches, the functions extracts the group values corresponding to the symbolic group names from the pattern. The first matching pattern "wins", others are not taken into account. Therefore the ordering of the list matters.

The getPossibleMatches() function

This function tries to return a list of existing records from the database table that might match (be identical) with this instance, ordered roughly by likelihood. It returns a list of tuples (id, likelihood, descr) where likelihood is a value between 0 and 100, and descr is tostring from entity table (human-readable description as plain text). The list is sorted by likelihood in descending order, so the first item is most likely one.

This function needs data stored in the match_try_list field of the entity table. Again, be careful editing it.

The format is:

[(10,"famname='%(famname)s'"),(90,"givname='%(givname)s' AND famname='%(famname)s'")]

It is a list of matching attempts between in-memory instance and DB instances. Each attempt consists of an approximate likelihood and an uninstantiated WHERE statement. This WHERE statement will be instantiated with values from self.vd, if possible, and with that instantiated WHERE statement a SELECT will be made. The results of all those SELECTs are returned after sorting, so here the order does not matter, as all elements are taken into account and the result is sorted anyway. Duplicates are removed.