======================================
Multi-language support for url aliases
======================================

:Author:    Jan Borsodi
:Version:   1.0beta2
:Date:      09.July.2007

Glossary
========

SEO:
  Search Engine Optimization

Transliterate:
  Transform alphabet into another (usually latin) by replacing
  characters, e.g. from cyrillic to latin.


Upgrade
=======

Step 1
------
Changes to the database tables and creation of new ones are handled
by the generic database upgrade procedure (update/database/*)

Step 2
------
Before the indexing can start the type of transformation for the URLs
must be decided. There are three types available.

1. Only allow a restricted set of characters in the url, this means
   a to z, numbers and underscore. (This is the same behaviour as in
   3.9 and earlier.)

   The identifier for this is *urlalias_compat*

2. Allow more characters in the url, but still restrict it to the
   ASCII characters (with a few exceptions). Capitalization of words
   are now kept.

   The identifier for this is *urlalias*

3. Similar to #2 but allow all Unicode characters (with a few
   exceptions). This allows the text to preserved as much as possible
   and is highly recommended for uni- or multi-lingual sites. The only
   changes to the text is removal of a few characters which are
   special to the urls on the Internet and trimming of multiple
   whitespaces to only one whitespace.
   It is recommened to use the utf-8 charset for the site when having
   this enabled (*i18n.ini*).

   The identifier for this is *urlalias_iri*

When the desired transformation is chosen it must be configured in
*site.ini* by setting the TransformationGroup setting in the settings
*group URLTranslator to contain the identifier of the chosen type.
e.g. if the third type was chosen::

  [URLTranslator]
  TransformationGroup=urlalias_iri

Advanced users might also want to take a look at *transform.ini* to
configure your own transformation group. Tweaking this file and adding
an extension to the transformation allows for full control over the
created URL aliases.

Note: #3 is referred to as IRI_ (Internationalized Resource
      Identifiers) which is a specialization of URI/URL with Unicode
      support.

.. _IRI: http://www.w3.org/International/O-URL-and-ident.html


Step 3
------

The transformation type has been chosen but the database still
contains the old aliases. To update the aliases and transfer old urls
as history the script *bin/php/updateniceurls.php* must
be used. It will first transfer the old aliases to the new system, then
it will go over all objects in the system and create new aliases for
them, the old aliases will then become history entries.

Run the script with::

  bin/php/updateniceurls.php

Note: The script will use all the languages of your site, and not just
      the ones defined in the chosen siteaccess. This is needed to
      create all aliases of all languages.

If you decide to change the transformation type later on you can
re-run this script, all existing aliases (using the old type) will
then be stored as history entries.


Note: Running this script may take some time depending on the number
      of nodes.

Step 4
------

Clear the content-cache, template-block and tree menu cache to allow
the new aliases to be used on the site.

e.g.::

  bin/php/ezcache.php --clear-id=content,template-block,content_tree_menu


Documentation
=============

How it works
------------

The new alias system follows the same behaviour as the multi-lingual
support added to eZ Publish 3.8.

Aliases are now created per translation on the object and not just for
the main translation. 

For instance if the following structure is present::

  Company (node 10)
  |-- About-us (node 11)
  `-- Contact (node 12)

Then the *node 10* is translated into german and gets the alias
*Unternehmen*, the structure is then::

  Company|Unternehmen (node 10)
  |-- About-us (node 11)
  `-- Contact (node 12)

Accessing the sub-nodes are now possible using the english and
german alias, ie::

  Company/About-us
  Unternehmen/About-us

Next if *node 12* is translated into french and gets the alias
*Contactez-nous*, the structure is then::

  Company|Unternehmen (node 10)
  |-- About-us (node 11)
  `-- Contact|Contactez-nous (node 12)

Accessing the sub-nodes are now possible using the english, french and
german alias, ie::

  Company/Contact
  Company/Contactez-nous
  Unternehmen/Contact
  Unternehmen/Contactez-nous

Note however that the chosen alias does not decide the language which
is shown on the site, this preference is handled by the language
preference list on the siteaccess. The alias is only present to find
the correct node using a more understandable name.

If a site-access with only french (and english as fallback) was used,
then only these aliases would be possible::

  Company/Contactez-nous
  Company/Contact
  Company/About-us

while the preferred alias for *node 12* would be *Company/Contactez-nous*.
If *node 10* was configured to be available no matter what the
language is, then the german alias would also be possible to use on
the french site-access.

URL alias pattern
-----------------

A new field has been added to the class edit interface which allows
for setting an attribute to act as the input field for the URL alias.
This works in the same way as the *Object name pattern* but is used
for the alias. This allows editors to use one name for the object and
another (usually abbreviated) for the alias.

Setting this is up requires creating a new attribute in the class,
name it accordingly and then use the identifier of the attribute in
the pattern input. e.g. if attribute is named *Alias* and has the
identifier *alias* the resulting pattern would be::

  <alias>

If the pattern is set to be empty the system will use the name of the
object instead.

Managing aliases
----------------

The GUI for managing URL aliases has been revamped and simplified and
is split into two parts.

Node aliases
~~~~~~~~~~~~

The first part is the new GUI for managing aliases for content nodes,
found at *content/urlalias/<nodeid>*, which provides a simpler
overview of the aliaes of any given node. The interface has the
possibility to create new aliases, remove existing ones and provides
handy links to the current aliases.

The GUI for new aliases consists of a language drop-down, an input
field for the alias and a checkbox.

The language drop-down decides the language of created alias, note
also that the `always available` flag is taken from the node.

The input contain the alias of the url, it can also contain a full url
to place the alias at a different location. The location may exist
from before, if it does not the entries are created but they cannot be
accessed, only the final alias can be accessed. If later on new
nodes/aliases are created with the same name as the fake locations the
locations will be reclaimed and can then be accessed.

The checkbox called `Relative to parent` decided where the alias/url
starts, if checked it will start from the parent of the current node
which makes the alias a sibling or child of the current node.
If unchecked the alias/url is created from the root of the site.

Note: Reaching the url alias page for the node can be done from the
      drop-down menu when viewing the node or the drop-down from the
      tree menu.

The generated aliases of the node is also shown in a different list,
modifying these can only be done by editing the node. Links to the
edit interface are present in this very same list.

Global aliases
~~~~~~~~~~~~~~

The other GUI is the existing global overview of aliases in the
system, found at *content/urltranslator*, which has been cleaned up to
make it easier to use.

The interface is similar to the one on the node with a few
differences. The main list will display all aliases in the system
sorted by the name of the element (not path), however it will not
display aliases for nodes since that is taken care of by the other
GUI.

Creating a new alias requires the input alias/url and the
destination. The destination can be another alias or a module+view in
the system. The input alias/url will always start at the root of the
site.
The language drop-down decides the language of the url and the
`Include in other languages` checkbox dictates whether the alias is
also available in other languages.

Note: If an alias is create to a node (either content/view/full/<node>
      or its alias) the alias will be created but will not appear in
      the global list. The user is informed about this and given a
      link to go the alias page for the node.

Transforming input
~~~~~~~~~~~~~~~~~~

When the user enters an alias or url the system will perform cleanup
of the input by using the same transformation rules as the generated
aliases. This is needed to avoid certain characters and ensure that
the alias conforms the urls of the site. If the alias is modified the
user will be notified about it.


Alias transformation
--------------------

The transformation of the entered/generated aliases has changed a bit
in 3.10. The changes were done based on user input and new
possibilities of the modern browsers.

Dash vs underscore
~~~~~~~~~~~~~~~~~~

Previosuly eZ Publish used underscores as the separators of
words. However more and more people have requested the desire to use
dashes instead of underscores, often related to the SEO reasons.
Instead of enforcing this behaviour eZ Publish allows the user to
decide which character to use as separator.

Change the INI setting *WordSeparator* in group *URLTranslator* (file
*site.ini*) to either *dash*, *underscore* or *space*.

Remember to run *updateniceurls.php* when this setting has changed.

Note: The space is currently an experimental feature, might be removed
      before the final 3.10 release.


Unicode support
~~~~~~~~~~~~~~~

The previous transformation rules were quite restrictive and only
allowed a subset of the ASCII character set (ie. a-z, 0-9 and _). This
causes lots of problems for non-western languages which uses different
alphabets, some which are quite hard to transliterate.
In eZ Publish 3.10 it is now possible to enable Unicode support for
the transformation, the result is that no transliteration is performed
and most characters are allowed. The only ones which are not allowed
are:

  space, ampersand, semi-colon, forward slash, colon, equal sign,
  question mark, square brackets, parenthesis and plus

They have to be transformed to avoid problems with their existing use
in the web pages and the HTTP protocol.

The Unicode characters are encoded using the IRI_ standard which
basically encodes the text into UTF-8 and then performs further
encoding as mentioned in `RFC 1738`_. The resulting url contains
characters which are valid for the HTTP protocols and will work in all
existing browsers/clients. Modern browsers will also decode the url
and display it as Unicode characters to the user.

To use the unicode format *site.ini* must configured, configure it by
using the *urlalias_iri* transformation, e.g.::

  [URLTranslator]
  TransformationGroup=urlalias_iri

.. _`RFC 1738`: http://www.faqs.org/rfcs/rfc1738

Case aware
~~~~~~~~~~

The case of a character is now no longer transformed into lowercase,
however all matching is done case-insensitive. This means that
original text is preserved as much as possible while it is still
possible to enter the text in any case you would like. It also means
that two nodes on the same level which only differs in case will not
get the same alias, one of them will be adjusted to be unique.

Case preservation is handled by the *urlalias* and *urlalias_iri*
transformation group, the *urlalias_compat* will perform the lowercase
conversion as it did earlier.

Compatability
~~~~~~~~~~~~~

As mentioned earlier in the document it is possible to get the old
behaviour of alias transformation back. Simply configure
*TransformationGroup* in *site.ini* to contain *urlalias_compat* as
the type, e.g.::

  [URLTranslator]
  TransformationGroup=urlalias_compat

Filtering of alias text
~~~~~~~~~~~~~~~~~~~~~~~

To provide even greater flexibility of the generated aliases the
ability to filter them was added. The system will invoke one or more
filters as defined in the system on the urls before the final filtered
result is transformed to a valid alias.
The filters can be created in extensions and added to the system with
INI configuration which can be found in the *site.ini* file under
the group *URLTranslator* and variable *Extensions*.
The active filters are set in the *Filters* variable and will be
searched for in all the extensions. The filename which is searched for
is the lowercase name of the filter with a *.php* suffix, the filter
name will also be the class name which is searched for, e.g.::

  Filters[]=StripWords

will look for file *stripwords.php* and class *StripWords*.
The filter class must implement a method called *process* which takes
three parameters, the text to filter, the language object
(eZContentLanguage) and the object which called the filter
process. The method must return the newly filtered text.

An example::

  class StripWords
  {
      function process( $text, $languageObject, $caller )
      {
          return str_replace( "and", "", $text );
      }
  }

Custom transformation commands
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The transformation also received an upgrade with the possibility to
create your own commands for doing transformation. 
The commands can be created in extensions and added to the system with
INI configuration which can be found in the *transform.ini* file under
the group *Extensions* and variable *Commands*.

To create a command follow these steps:

Step 1.
^^^^^^^

Create a new class (e.g. *MyReverse*) and place it in your extension, e.g. *extension/myextension/transformation/myreverse.php*.
This class must have a static method called *executeCommand* which takes three parameters, *$text*, *$command* and *$charsetName*.

- $text - The input text to transform.
- $command - The name of the command to execute, this can be used to keep multiple commands in one function.
- $charsetName - The name of the charset in use for $text, usually not needed.

The full code would look something like::

  class MyReverse
  {
      function executeCommand( $text, $command, $charsetName )
      {
          $text = strrev( $text );
          return $text;
      }
  }

Step 2.
^^^^^^^

Register the PHP code in the *transform.ini* INI file, it needs an entry in the *Extensions* group under the variable *Commands*, this expects the path to the PHP file and the class name (separated by a colon).

The INI entry would look like::

  [Extensions]
  Commands[]
  Commands[my_reverse]=extension/myextension/transformation/myreverse.php:MyReverse

Step 3.
^^^^^^^

Add your newly created command to one of the transformation groups by doing something like::

  Commands[]=my_reverse

Now the command is registered and should be working.

Step 4.
^^^^^^^

To test that the command is working try this code snippet::

  include_once( 'lib/ezi18n/classes/ezchartransform.php' );
  
  $textList = array( 'Hello there', 'What_the?' );
  
  $transform = eZCharTransform::instance();
  
  foreach ( $textList as $text )
  {
      $trText = $transform->transformByGroup( $text, 'urlalias' );
      echo "Original text '$text'\n";
      echo "New text      '$trText'\n";
  }

store it in *trans.php* and run it with *bin/php/ezexec.php*, e.g.::

  ./bin/php/ezexec.php trans.php
 
Developer changes
=================

The first change is in the database schema. A new table called
*ezurlalias_ml* has been added which contains all the new aliases. The
old aliases are still kept in the table *ezurlalias*, but a new column
is added, called *is_imported*, which keeps track of which of the old
aliases have been succesfully imported.
The design of the *ezurlalias_ml* table is explained in more detail in
a different document.

Together with the new table there is also a new class (actually more
than one) which handles the new aliases. The class is eZURLAliasML and
is located at *kernel/classes/ezurlaliasml.php*. The old class
eZURLAlias is still present but the functions have been disabled.
The eZURLAliasML class together with eZPathElement, eZURLAliasQuery
and eZURLAliasFilter handles all the new functionality and is designed
to be used by extensions and other parts of the system (ie. no longer
just eZContentObjectTreeNode). To get an overview of how to do this
the API documentation of eZURLAliasML should be examined.

The old wildcard cache is no longer needed and can safely be removed
from the filesystem if it is present. The new database design has
made it obsolete.

eZURLAlias
----------

The class is deprecated and using it will stop the running
process. All code using this class must be converted to use
eZURLAliasML instead which has the same signatures for the following
functions *translate*, *cleanURL*, *convertPathToAlias* and
*convertToAlias*. The rest of the functionality has changed as need by
the new database design.

eZContentObjectTreeNode
-----------------------

The attribute *path_identification_string* has been kept but will only
be created for the main language of the node. For real multi-lingual
path entries use the *urlAlias* function (*url* also) and the
attribute *url_alias* (*url* also) for templates.

The method *updateURLAlias* has been deprecated in favor for
*updateSubTreePath* which is modified to handle multiple languages.

eZURI
-----

The decoding of IRI_ input has been added, the result is that the
resulting URI is in the charset of the site.
In addition encoding of outgoing URLs has been added to
*kernel/error/view.php* and *lib/ezutils/classes/ezhttptool.php*.