eXtensible Text Framework (XTF) Code

Framework for search and display of heterogenous document collections.

Brought to you by: khasting, lrschiff, mhaye
[68390e]: / CHANGES Maximize Restore History
463 lines (338 with data), 18.5 kB

Changes in 1.9
================

- Lots up documentation updates. See docs/index.html for details.
  [Support #1548143 1542336, 1524668, 1493134, 1487452,
  1470971, 1430268, 1365117, 1298253, 1292469, 1289172,
  1289092, 1289091, 1285432, 1285208, 1285201, 1285189,
  1269967]

- Stylesheets can now cause a real HTTP redirect, to send the
  user's browser to a different URL. [Feature 1568108]

- New query operator: multi-field AND, that requires *all* terms
  to be present, but in *any* of the listed fields. Default
  stylesheets now use this for basic "keyword" search.
  [Feature 1607168]

- New query operator: orNear. This is like a typical OR query,
  except that when multiple terms are present in the same meta-
  data field, their proximity is taken into account when
  scoring. [Feature 1607170]

- Added Saxon extension functions to check a file's existence,
  and get its length or timestamp. [Features 1118097, 1306981]

- Fixed bug: Scoring of hits in the full text ("text" field)
  was not normalizing based on the length of the document,
  causing full-text documents to dominate meta-data-only docs in
  keyword searches that include both. [Bug 1607181]

- Fixed bug: The original EAD stylesheets didn't show hits
  in context. Also, the output was very different from the TEI
  display. EAD formatting is now much better. [Bug 1548258]

- Fixed bug: Some links in the TEI navigation bar were broken.
  [Bug 1607175]

- Fixed bug: wildcards containing accented characters were not
  working properly. [Bug 1571681]

- Fixed bug: Diacritic characters in URLs weren't always being 
  properly decoded by the servlets. [Bug 1570248]

- Fixed bug: The XTF query parser was rejecting <sectionType> 
  queries within <near> queries. [Bug 1570262]

- Fixed bug: Pass-through tags had stopped working, with the
  servlets rejecting them. [Bug 1554410]

- Fixed bug: "Exact" queries beginning or ending with stop
  words failed to match. [Bug 1548289]

- Fixed bug: the year parsing in the default textIndexer
  prefilters was failing to parse dates with repeated years
  such as "December 1999, copyright (C) 1999". [Bug 1548260]

- Fixed bug: Synthetic queries (such as "more like this") were
  using the Lucene "coordination factor" and thus getting
  artificially low scores. The coord factor is only for user-
  generated queries. [Bug 1607177]

- Fixed bug: Default crossQuery result formatter was jamming all
  text snippets together on one line. [Bug 1607173]

- Fixed bug: XTF query parser wasn't recognizing the "maxSnippets"
  attribute on the main <query> element, only on sub-elements.
  [Bug 1591585]

- Fixed bug: phrase queries containing repeated terms failed to
  match at all. [Bug 1523473]

- Added experimental dynamic FRBR mode... see docs/
  experimental.html for details.

- To aid in searching fine-grained numeric data (such as
  timestamps), a new mode for the <range> operator has been
  added that that allows for very efficient searching on numeric
  data. [Feature 1541643]

- A couple of jar files have been removed, as they were no longer
  necessary. [Bug 1585405]

- The web.xml file included should now be more compatible with
  Jetty. [Bug 1585408]

- A batch file for running the textIndexer under Windows is now
  included in the distribution. [Feature 1595965]

- Added support for TEI tables in 
  style/dynaXML/docFormatter/tei/table.xsl [Bug 1607153]

- Acknowledgement messages for the generous support from the
  Andrew W. Mellon Foundation have been placed in the various
  source code modules that were primarily impacted by the
  grant-funded work. [Feature 1607184]

Changes in 1.8
==============

- The default dynaXML docFormatterCommon.xsl stylesheet was not
  properly computing the path for figure references if the doc
  source is external (e.g. specified by source=http://xxx in 
  the URL). [Bug 1542867]

- The sample "book bag" and "more like this" features were broken,
  due to missing script files in the distribution. Also, a few
  source files needed to rebuild the internal JavaCC parsers in
  XTF were missing. [Bug 1542866]

Changes in 1.8-beta 
===================

- Many users have requested EAD support in XTF out-of-the-box. 
  While XTF has always been capable of handling these, the 
  default stylesheets were very TEI-centric. This release contains 
  brand new stylesheets that support TEI, EAD, PDF, HTML, and Text.
  Flexible meta-data handling will use *.dc files if present. If 
  not present, will look inside TEI and EAD documents. Also, the
  confusing reliance on *.mets files has been completely removed.
  [Feature 1534843]

- Disabled non-standard whitespace stripping while building lazy
  tree files. Previously, XTF stripped whitespace between elements,
  which caused differing results from the same stylesheets run
  through Saxon from the command-line. If absolutely necessary,
  there is an undocumented index config flag to turn stripping back
  on: <whitespace strip="yes"/> [Bug 1534845]

- Upgraded PDFBox to most recent version (0.7.2) which offers
  greater speed and stability, and better results.

- Fixed FileUtils.exists() function called by some stylesheets to
  automatically handle a "file:" prefix if present. [Bug 1527960]

- Fixed PDF filter in indexer to automatically escape XML characters
  such as '<', and to strip out invalid characters. [Bug 1527958]

- Same fix for text files. [Bug 1523481]

- Certain unusual queries caused an assertion in FieldSpanSource:
  "kept span was cancelled". Fixed. [Bug 1523479]

- Fixed problem that kept JavaDocs from building. [Bug 1534856]

- XTF now avoids loading external DTDs for documents pulled in
  through the Saxon document() function. This helps speed the
  processing, and reduces external dependencies. [Feature 1487684]

- Fixed bug that caused indexer to crash if resulting index is
  empty (e.g. if no docs found). [Bug 1534860]

- Fixed bug: indexDump would only output first of multiple
  un-tokenized values for a field. [Bug 1534861]

- Experimental support for spelling correction has been added.
  Documentation to follow.

- Experimental new query operator added: <orNear>, which is like
  a standard OR query except that it will take proximity into
  account when multiple terms are present in one document.

- Improvements to the experimental "more like this" query. It
  may be getting close to prime-time.

- The XTF icon has been changed to be more descriptive, less
  confusing, and arguably less fun. "XTF Man" is gone.


Changes in 1.7 
==============

- Change log now contains item numbers from the SourceForge trackers
  ("Feature Requests" or "Bugs") which can be referenced for more
  detailed information.

- Added new front end to crossQuery servlet. The new "query router"
  stylesheet allows the use of multiple query parsers. Those just
  starting out, or who only need one parser, can use the default
  queryRouter.xsl without change. [Feature Req 1470967]

- textIndexer now allows "deep" section type indexing. A new attribute
  "sectionTypeAdd" can be inserted by the prefilter stylesheet. This
  causes the text in that section to inherit its parent's sectionType
  and add the specified text. This allows simple processing of
  hierarchical sections without complex prefilter code. 
  [Feature Req 1491315]

- Many users have expressed confusion over the way document IDs were
  handled in dynaXML, and observed that much CDL-centric code is
  present in the default stylesheets. These have been refactored,
  and document IDs are now simply the path from the data directory
  to each document, instead of a strict 10-character code.
  [Feature Req 1499142]
  
- XTF now allows stylesheets to track data on a per-user-session
  basis. A simple API is provided to get and set state data. The
  session identifier is tracked using cookies, or if the user has
  cookies disabled, though URL rewriting. [Feature Req 1470973]

- Default stylesheets now expose "Book Bag" and "More Like This"
  functionality. The former is based on the session state API, the
  latter on the new <moreLikeThis> query operator. These also
  demonstrate an AJAX style of programming, updating pages on the fly.
  [Feature Req 1470975]

- New "exact" query operator added. To match, the field must contain
  exactly the query phrase; no more, no less. [Feature Req 1120263]

- Added new "moreLike" query operator which uses a simple index-based
  algorithm to locate additional documents that resemble a specified
  document. This feature is considered experimental and subject to
  improvement/change. [Feature Req 1470968]

- Made minor changes to the experimental "boost set" facility.

- Fixed bug in phrase query if stop-words appeared at start or end of
  a meta-data field. [Bug 1470978]

- Fixed bug with where apostrophe and other combined words at start or
  end of a meta-data field would cause queries to not match. [Bug
  1437031]
  
- Fixed bug causing boost values to have no effect on an <or> query.
  [Bug 1471061]

- Refactored Lucene integration. The result is more modular, which
  will help in upgrading to Lucene 1.9 and 2.0. Back-ported selected
  classes to improve span processing on indexes with millions of
  records. [Feature Req 1470982]

- Config file parameters are now case insensitive. Also, boolean
  parameters all uniformly accept "true", "yes", "1" as synonyms, and
  "false", "no", and "0" as synonyms. [Feature Req 1471004]

- Added ability to display non-normalized scores (or raw) scores in
  crossQuery. [Feature Req 1471009]

- Added optional "score explanation" in crossQuery, to give a very
  detailed description of how each document's score was computed.
  [Feature Req 1471015]

- Made several changes and fixes to the experimental 'facet' feature.

- Multiple index prefilters may now be specified for one document by
  the docSelector stylesheet. The prefilters will be run in a chain.
  [Feature Req 1471018]

- Added support for parsing MARC21 data files. The indexer will break
  them into records, convert them to MARCXML format, and pass each
  converted record to the prefilter(s). Very large files are
  supported, and the indexer will try to skip bad records and recover.
  [Feature Req 1471020]

- Fixed null pointer exception in dynaXML when an empty query was
  specified. [Bug 1471022]

- Servlets now allow ";" to separate URL parameters. This can be quite
  handy as opposed to "&", since the latter requires special escaping
  in stylesheets. Both are now supported interchangeably. [Feature Req
  1471023]
  
- All references to "ngrams" have been changed to the more specific
  term "bigrams".

- Improved efficiency of span collection in the Text Engine.

- Vastly reduced memory usage of cached sorting arrays for indexes
  that contain only meta-data.

- All servlets now pass a "servlet.dir" parameter to stylesheets. This
  is the home directory of the XTF installation, and can be used by
  stylesheets to locate data files or for other purposes. [Feature Req
  1397346]

- crossQueryResult input to resultFormatter stylesheet now contains
  the original parsed URL parameters, and the query that resulted from
  the queryParser stylesheet. Both of these can be quite useful in
  result formatting. [Feature Req 1471062]

- Queries output from queryParser stylesheet may now optionally
  contain <resultData> elements. These are ignored by the Text Engine,
  but passed on to the result formatter stylesheet. They're a handy
  way for the query parser to pass data directly to the result
  formatter. [Feature Req 1471063]

- Meta-data fields can now be marked in index prefilter as
  xtf:store="no", which prevents them from showing up in query
  results. The field is still indexed, just not stored or displayed.
  [Feature Req 1471065]

- Similarly, the index prefilter can mark a field with xtf:index="no",
  causing it to not be indexed (and this not searchable) but still be
  stored and displayed. [Feature Req 1471065]

- Improved efficiency of textIndexer's culling phase. In particular,
  it no longer runs out of memory and crashes on indexes with millions
  of documents. [Bug 1471067]

- 'indexStats' tool is now much faster, and attempts to provide as
  much information early in the process as possible. Also, doesn't
  crash on large indexes. [Feature Req 1291547]

- Added new 'indexDump' tool, which can dump selected meta-data fields
  from all documents in an index. [Feature Req 1471070]

- Fixed bug where indexer would occasionally crash when trying to
  create a lazy tree file without creating its directory first.
  [Bug 1471071]

- Fixed bug that caused XML namespace declarations to be dropped from
  the beginning of in lazy tree files. [Bug 1397341]

- textIndexer now tracks and displays the elapsed time of each
  indexing run. [Feature Req 1471072]
  
- crossQuery wasn't paying attention to the MIME type specified by
  Result Formatter stylesheet output specification. Now the default is
  (text/html) is only used if none specified. [Bug 1499137]
  
- Fixed assertion failure when a <not> clause appeared within a
  <near> query. [Bug 1489230]
  
- Fixed a bug in the internal simplification of boolean queries that
  caused an assertion failure when searching for "the". [Bug 1482066]
  
- Fixed bug in dynaXML that gave an unenlightening error message if
  the source file specified by the docReqParser is actually a
  directory. [Bug 1499148]
  

Changes in 1.6.1 
================

- New "debug step" mode added, which can be very handy both to
  understand crossQuery and to debug stylesheet problems. This is
  enabled by adding "&debugStep=1" to the crossQuery URL. This also
  works in the experimental SRU servlet. [Feature Req 1292474]

- Added optional ability to turn on a "runaway" timer, that will
  report and optionally kill off requests that exceed specified time
  limits. This can help in tracking down intermittent server
  slowdowns. This is configured in conf/crossQuery.conf and
  conf/dynaXML.conf.

- Added optional cutoff size for latency reporting. After a request
  has exceeded this amount of data, the servlet will report the
  latency immediately. When the request finally finishes, the final
  latency is also reported. This is configured in conf/crossQuery.conf
  and conf/dynaXML.conf.

- Minor improvements to paging behavior in crossQuery
  resultFormatterCommon.xsl.

- Fixed "Modify Query" link in crossQuery default/resultFormatter.xsl.

- Fixed bug that caused the Content-Type of "raw" mode output from
  dynaXML to be "text/html" instead of the proper "text/xml".
  [Bug 1397342]

- Fixed a potential thread synchronization issue in lazy file access.

- Changed timestamp output to be more compatible with Resin and
  Tomcat.

- Fixed thread contention issue with query rewriting.

- Fixed memory leak with performing searches in dynaXML.

- Fixed handling of '&quot;' and '&amp;' characters in meta-data
  fields during indexing (was throwing an exception instead of passing
  these through.)

- Switched indexer to using Lucene's "compound files" mode. This
  results in indexes that have many fewer files, and thus avoids
  problems with running out of filesystem handles. The indexes are
  compatible, and the indexer will silently upgrade older indexes to
  the new compound files.

- TextIndexer now outputs the XTF version number (1.6.1) instead of
  the perpetual "1.0".

- Reduced memory usage of accent and plural mapping facilities.

- Clarified error message when an exception is encountered during
  Saxon processing.

- Minor documentation updates to reflect new features above, but a
  full documentation revision will have to wait until the 1.7 release.

Changes in 1.6.0 
================

- Fixed caching problem that caused sort/group data to be reloaded on
  each query, rather than cached between queries as was intended.
  [Bug 1285170]

- Fixed static variable problem that was causing the SRU and
  crossQuery servlets to conflict with each other.

- Fixed multi-threading bug: when many simultaneous crossQuery threads
  tried to access the same index, they would sometimes corrupt each
  others' span results.

- Fixed bug: for apps that use the QueryProcessor Java API, the hit
  count and score normalization were not being reset from one use to
  the next. This bug did not affect crossQuery or dynaXML, which make
  a new QueryProcessor for every request.

- Fixed to avoid marking terms specified in a <not> query.

- Fixed a bug causing the indexer to crash when tokenizing certain
  fields ending in "."

- Fixed 'textIndexer' and 'indexStats' scripts to work properly under
  Microsoft Windows.

- Fixed a bug in handling of '&', '<', and '>' in source documents:
  they were being double-escaped. For instance, '&' would become
  '&amp;amp' instead of '&amp;'.

- Fixed a bug in handling of the XSLT 'previous::*' axis. The axis
  would operate incorrectly on lazy trees, essentially acting just
  like 'previous-sibling::*'.

- The SRU servlet was completely broken, but is now working again.

- Sample stylesheets now provide an option to reverse the order of
  sort-by-year.

- Added a new feature (as yet undocumented) that allows stylesheets to
  call out to external command-line tools. Robustly handles XML input
  and output, and allows a timeout specification. See
  regress/CrossQuery/K-External for examples of how to use this
  facility.

- Distribution now available as either a full distribution as before,
  or split into "core" and "example" pieces. The "core" piece is
  especially useful for existing users to upgrade the core while
  leaving all their stylesheets and configuration files intact.

- Minor documentation corrections.

Changes in 1.5.1 
================

- Now works again under Java 1.4 (was using Integer.valueOf(int),
  which is only present in Java 1.5)

- Corrected problem where dynaXML wouldn't run unless an index was
  present (tried to create lazy file in a directory that didn't exist
  yet.)

- Fixed textIndexer and indexStats scripts to allow spaces in the
  XTF_HOME path, and to properly switch between ":" classpath
  separation on Unix and ";" on Windows.

- crossQuery servlet now passes the time (in seconds) it took to parse
  and process the query to the resultFormatter stylesheet.
  Documetation reflects this. [Feature Req 1250702]

- Installation procedures corrected and simplified. Please see new
  documentation for more details.
eXtensible Text Framework (XTF) Code

Framework for search and display of heterogenous document collections.

Branches

Tags

[68390e]: / CHANGES Maximize Restore History

463 lines (338 with data), 18.5 kB