463 lines (338 with data), 18.5 kB
Changes in 1.9
================
- Lots up documentation updates. See docs/index.html for details.
[Support #1548143 1542336, 1524668, 1493134, 1487452,
1470971, 1430268, 1365117, 1298253, 1292469, 1289172,
1289092, 1289091, 1285432, 1285208, 1285201, 1285189,
1269967]
- Stylesheets can now cause a real HTTP redirect, to send the
user's browser to a different URL. [Feature 1568108]
- New query operator: multi-field AND, that requires *all* terms
to be present, but in *any* of the listed fields. Default
stylesheets now use this for basic "keyword" search.
[Feature 1607168]
- New query operator: orNear. This is like a typical OR query,
except that when multiple terms are present in the same meta-
data field, their proximity is taken into account when
scoring. [Feature 1607170]
- Added Saxon extension functions to check a file's existence,
and get its length or timestamp. [Features 1118097, 1306981]
- Fixed bug: Scoring of hits in the full text ("text" field)
was not normalizing based on the length of the document,
causing full-text documents to dominate meta-data-only docs in
keyword searches that include both. [Bug 1607181]
- Fixed bug: The original EAD stylesheets didn't show hits
in context. Also, the output was very different from the TEI
display. EAD formatting is now much better. [Bug 1548258]
- Fixed bug: Some links in the TEI navigation bar were broken.
[Bug 1607175]
- Fixed bug: wildcards containing accented characters were not
working properly. [Bug 1571681]
- Fixed bug: Diacritic characters in URLs weren't always being
properly decoded by the servlets. [Bug 1570248]
- Fixed bug: The XTF query parser was rejecting <sectionType>
queries within <near> queries. [Bug 1570262]
- Fixed bug: Pass-through tags had stopped working, with the
servlets rejecting them. [Bug 1554410]
- Fixed bug: "Exact" queries beginning or ending with stop
words failed to match. [Bug 1548289]
- Fixed bug: the year parsing in the default textIndexer
prefilters was failing to parse dates with repeated years
such as "December 1999, copyright (C) 1999". [Bug 1548260]
- Fixed bug: Synthetic queries (such as "more like this") were
using the Lucene "coordination factor" and thus getting
artificially low scores. The coord factor is only for user-
generated queries. [Bug 1607177]
- Fixed bug: Default crossQuery result formatter was jamming all
text snippets together on one line. [Bug 1607173]
- Fixed bug: XTF query parser wasn't recognizing the "maxSnippets"
attribute on the main <query> element, only on sub-elements.
[Bug 1591585]
- Fixed bug: phrase queries containing repeated terms failed to
match at all. [Bug 1523473]
- Added experimental dynamic FRBR mode... see docs/
experimental.html for details.
- To aid in searching fine-grained numeric data (such as
timestamps), a new mode for the <range> operator has been
added that that allows for very efficient searching on numeric
data. [Feature 1541643]
- A couple of jar files have been removed, as they were no longer
necessary. [Bug 1585405]
- The web.xml file included should now be more compatible with
Jetty. [Bug 1585408]
- A batch file for running the textIndexer under Windows is now
included in the distribution. [Feature 1595965]
- Added support for TEI tables in
style/dynaXML/docFormatter/tei/table.xsl [Bug 1607153]
- Acknowledgement messages for the generous support from the
Andrew W. Mellon Foundation have been placed in the various
source code modules that were primarily impacted by the
grant-funded work. [Feature 1607184]
Changes in 1.8
==============
- The default dynaXML docFormatterCommon.xsl stylesheet was not
properly computing the path for figure references if the doc
source is external (e.g. specified by source=http://xxx in
the URL). [Bug 1542867]
- The sample "book bag" and "more like this" features were broken,
due to missing script files in the distribution. Also, a few
source files needed to rebuild the internal JavaCC parsers in
XTF were missing. [Bug 1542866]
Changes in 1.8-beta
===================
- Many users have requested EAD support in XTF out-of-the-box.
While XTF has always been capable of handling these, the
default stylesheets were very TEI-centric. This release contains
brand new stylesheets that support TEI, EAD, PDF, HTML, and Text.
Flexible meta-data handling will use *.dc files if present. If
not present, will look inside TEI and EAD documents. Also, the
confusing reliance on *.mets files has been completely removed.
[Feature 1534843]
- Disabled non-standard whitespace stripping while building lazy
tree files. Previously, XTF stripped whitespace between elements,
which caused differing results from the same stylesheets run
through Saxon from the command-line. If absolutely necessary,
there is an undocumented index config flag to turn stripping back
on: <whitespace strip="yes"/> [Bug 1534845]
- Upgraded PDFBox to most recent version (0.7.2) which offers
greater speed and stability, and better results.
- Fixed FileUtils.exists() function called by some stylesheets to
automatically handle a "file:" prefix if present. [Bug 1527960]
- Fixed PDF filter in indexer to automatically escape XML characters
such as '<', and to strip out invalid characters. [Bug 1527958]
- Same fix for text files. [Bug 1523481]
- Certain unusual queries caused an assertion in FieldSpanSource:
"kept span was cancelled". Fixed. [Bug 1523479]
- Fixed problem that kept JavaDocs from building. [Bug 1534856]
- XTF now avoids loading external DTDs for documents pulled in
through the Saxon document() function. This helps speed the
processing, and reduces external dependencies. [Feature 1487684]
- Fixed bug that caused indexer to crash if resulting index is
empty (e.g. if no docs found). [Bug 1534860]
- Fixed bug: indexDump would only output first of multiple
un-tokenized values for a field. [Bug 1534861]
- Experimental support for spelling correction has been added.
Documentation to follow.
- Experimental new query operator added: <orNear>, which is like
a standard OR query except that it will take proximity into
account when multiple terms are present in one document.
- Improvements to the experimental "more like this" query. It
may be getting close to prime-time.
- The XTF icon has been changed to be more descriptive, less
confusing, and arguably less fun. "XTF Man" is gone.
Changes in 1.7
==============
- Change log now contains item numbers from the SourceForge trackers
("Feature Requests" or "Bugs") which can be referenced for more
detailed information.
- Added new front end to crossQuery servlet. The new "query router"
stylesheet allows the use of multiple query parsers. Those just
starting out, or who only need one parser, can use the default
queryRouter.xsl without change. [Feature Req 1470967]
- textIndexer now allows "deep" section type indexing. A new attribute
"sectionTypeAdd" can be inserted by the prefilter stylesheet. This
causes the text in that section to inherit its parent's sectionType
and add the specified text. This allows simple processing of
hierarchical sections without complex prefilter code.
[Feature Req 1491315]
- Many users have expressed confusion over the way document IDs were
handled in dynaXML, and observed that much CDL-centric code is
present in the default stylesheets. These have been refactored,
and document IDs are now simply the path from the data directory
to each document, instead of a strict 10-character code.
[Feature Req 1499142]
- XTF now allows stylesheets to track data on a per-user-session
basis. A simple API is provided to get and set state data. The
session identifier is tracked using cookies, or if the user has
cookies disabled, though URL rewriting. [Feature Req 1470973]
- Default stylesheets now expose "Book Bag" and "More Like This"
functionality. The former is based on the session state API, the
latter on the new <moreLikeThis> query operator. These also
demonstrate an AJAX style of programming, updating pages on the fly.
[Feature Req 1470975]
- New "exact" query operator added. To match, the field must contain
exactly the query phrase; no more, no less. [Feature Req 1120263]
- Added new "moreLike" query operator which uses a simple index-based
algorithm to locate additional documents that resemble a specified
document. This feature is considered experimental and subject to
improvement/change. [Feature Req 1470968]
- Made minor changes to the experimental "boost set" facility.
- Fixed bug in phrase query if stop-words appeared at start or end of
a meta-data field. [Bug 1470978]
- Fixed bug with where apostrophe and other combined words at start or
end of a meta-data field would cause queries to not match. [Bug
1437031]
- Fixed bug causing boost values to have no effect on an <or> query.
[Bug 1471061]
- Refactored Lucene integration. The result is more modular, which
will help in upgrading to Lucene 1.9 and 2.0. Back-ported selected
classes to improve span processing on indexes with millions of
records. [Feature Req 1470982]
- Config file parameters are now case insensitive. Also, boolean
parameters all uniformly accept "true", "yes", "1" as synonyms, and
"false", "no", and "0" as synonyms. [Feature Req 1471004]
- Added ability to display non-normalized scores (or raw) scores in
crossQuery. [Feature Req 1471009]
- Added optional "score explanation" in crossQuery, to give a very
detailed description of how each document's score was computed.
[Feature Req 1471015]
- Made several changes and fixes to the experimental 'facet' feature.
- Multiple index prefilters may now be specified for one document by
the docSelector stylesheet. The prefilters will be run in a chain.
[Feature Req 1471018]
- Added support for parsing MARC21 data files. The indexer will break
them into records, convert them to MARCXML format, and pass each
converted record to the prefilter(s). Very large files are
supported, and the indexer will try to skip bad records and recover.
[Feature Req 1471020]
- Fixed null pointer exception in dynaXML when an empty query was
specified. [Bug 1471022]
- Servlets now allow ";" to separate URL parameters. This can be quite
handy as opposed to "&", since the latter requires special escaping
in stylesheets. Both are now supported interchangeably. [Feature Req
1471023]
- All references to "ngrams" have been changed to the more specific
term "bigrams".
- Improved efficiency of span collection in the Text Engine.
- Vastly reduced memory usage of cached sorting arrays for indexes
that contain only meta-data.
- All servlets now pass a "servlet.dir" parameter to stylesheets. This
is the home directory of the XTF installation, and can be used by
stylesheets to locate data files or for other purposes. [Feature Req
1397346]
- crossQueryResult input to resultFormatter stylesheet now contains
the original parsed URL parameters, and the query that resulted from
the queryParser stylesheet. Both of these can be quite useful in
result formatting. [Feature Req 1471062]
- Queries output from queryParser stylesheet may now optionally
contain <resultData> elements. These are ignored by the Text Engine,
but passed on to the result formatter stylesheet. They're a handy
way for the query parser to pass data directly to the result
formatter. [Feature Req 1471063]
- Meta-data fields can now be marked in index prefilter as
xtf:store="no", which prevents them from showing up in query
results. The field is still indexed, just not stored or displayed.
[Feature Req 1471065]
- Similarly, the index prefilter can mark a field with xtf:index="no",
causing it to not be indexed (and this not searchable) but still be
stored and displayed. [Feature Req 1471065]
- Improved efficiency of textIndexer's culling phase. In particular,
it no longer runs out of memory and crashes on indexes with millions
of documents. [Bug 1471067]
- 'indexStats' tool is now much faster, and attempts to provide as
much information early in the process as possible. Also, doesn't
crash on large indexes. [Feature Req 1291547]
- Added new 'indexDump' tool, which can dump selected meta-data fields
from all documents in an index. [Feature Req 1471070]
- Fixed bug where indexer would occasionally crash when trying to
create a lazy tree file without creating its directory first.
[Bug 1471071]
- Fixed bug that caused XML namespace declarations to be dropped from
the beginning of in lazy tree files. [Bug 1397341]
- textIndexer now tracks and displays the elapsed time of each
indexing run. [Feature Req 1471072]
- crossQuery wasn't paying attention to the MIME type specified by
Result Formatter stylesheet output specification. Now the default is
(text/html) is only used if none specified. [Bug 1499137]
- Fixed assertion failure when a <not> clause appeared within a
<near> query. [Bug 1489230]
- Fixed a bug in the internal simplification of boolean queries that
caused an assertion failure when searching for "the". [Bug 1482066]
- Fixed bug in dynaXML that gave an unenlightening error message if
the source file specified by the docReqParser is actually a
directory. [Bug 1499148]
Changes in 1.6.1
================
- New "debug step" mode added, which can be very handy both to
understand crossQuery and to debug stylesheet problems. This is
enabled by adding "&debugStep=1" to the crossQuery URL. This also
works in the experimental SRU servlet. [Feature Req 1292474]
- Added optional ability to turn on a "runaway" timer, that will
report and optionally kill off requests that exceed specified time
limits. This can help in tracking down intermittent server
slowdowns. This is configured in conf/crossQuery.conf and
conf/dynaXML.conf.
- Added optional cutoff size for latency reporting. After a request
has exceeded this amount of data, the servlet will report the
latency immediately. When the request finally finishes, the final
latency is also reported. This is configured in conf/crossQuery.conf
and conf/dynaXML.conf.
- Minor improvements to paging behavior in crossQuery
resultFormatterCommon.xsl.
- Fixed "Modify Query" link in crossQuery default/resultFormatter.xsl.
- Fixed bug that caused the Content-Type of "raw" mode output from
dynaXML to be "text/html" instead of the proper "text/xml".
[Bug 1397342]
- Fixed a potential thread synchronization issue in lazy file access.
- Changed timestamp output to be more compatible with Resin and
Tomcat.
- Fixed thread contention issue with query rewriting.
- Fixed memory leak with performing searches in dynaXML.
- Fixed handling of '"' and '&' characters in meta-data
fields during indexing (was throwing an exception instead of passing
these through.)
- Switched indexer to using Lucene's "compound files" mode. This
results in indexes that have many fewer files, and thus avoids
problems with running out of filesystem handles. The indexes are
compatible, and the indexer will silently upgrade older indexes to
the new compound files.
- TextIndexer now outputs the XTF version number (1.6.1) instead of
the perpetual "1.0".
- Reduced memory usage of accent and plural mapping facilities.
- Clarified error message when an exception is encountered during
Saxon processing.
- Minor documentation updates to reflect new features above, but a
full documentation revision will have to wait until the 1.7 release.
Changes in 1.6.0
================
- Fixed caching problem that caused sort/group data to be reloaded on
each query, rather than cached between queries as was intended.
[Bug 1285170]
- Fixed static variable problem that was causing the SRU and
crossQuery servlets to conflict with each other.
- Fixed multi-threading bug: when many simultaneous crossQuery threads
tried to access the same index, they would sometimes corrupt each
others' span results.
- Fixed bug: for apps that use the QueryProcessor Java API, the hit
count and score normalization were not being reset from one use to
the next. This bug did not affect crossQuery or dynaXML, which make
a new QueryProcessor for every request.
- Fixed to avoid marking terms specified in a <not> query.
- Fixed a bug causing the indexer to crash when tokenizing certain
fields ending in "."
- Fixed 'textIndexer' and 'indexStats' scripts to work properly under
Microsoft Windows.
- Fixed a bug in handling of '&', '<', and '>' in source documents:
they were being double-escaped. For instance, '&' would become
'&amp' instead of '&'.
- Fixed a bug in handling of the XSLT 'previous::*' axis. The axis
would operate incorrectly on lazy trees, essentially acting just
like 'previous-sibling::*'.
- The SRU servlet was completely broken, but is now working again.
- Sample stylesheets now provide an option to reverse the order of
sort-by-year.
- Added a new feature (as yet undocumented) that allows stylesheets to
call out to external command-line tools. Robustly handles XML input
and output, and allows a timeout specification. See
regress/CrossQuery/K-External for examples of how to use this
facility.
- Distribution now available as either a full distribution as before,
or split into "core" and "example" pieces. The "core" piece is
especially useful for existing users to upgrade the core while
leaving all their stylesheets and configuration files intact.
- Minor documentation corrections.
Changes in 1.5.1
================
- Now works again under Java 1.4 (was using Integer.valueOf(int),
which is only present in Java 1.5)
- Corrected problem where dynaXML wouldn't run unless an index was
present (tried to create lazy file in a directory that didn't exist
yet.)
- Fixed textIndexer and indexStats scripts to allow spaces in the
XTF_HOME path, and to properly switch between ":" classpath
separation on Unix and ";" on Windows.
- crossQuery servlet now passes the time (in seconds) it took to parse
and process the query to the resultFormatter stylesheet.
Documetation reflects this. [Feature Req 1250702]
- Installation procedures corrected and simplified. Please see new
documentation for more details.