Darryl Jenkins
New Member Posts:16
|
10/03/2012 6:03 PM |
|
I'm having problems getting accurate PDF content search results using DMX v6.03 Lucene Search Provider. I have the latest IFilters installed and my site is running in Full trust. Using LUKE to examine the index, I see some problems. For example, committe is indexed but committee is not. Pilot is indexed but not pilots. There are many cases where words ending in s are not indexed. Using a stand alone third party DNN search engine (Search Boost) and examining its index shows committee (not committe) and both pilots and pilot and overall, returns more accurate results. Is there anything I can do improve the accuracy of the results returned by DMX? |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/04/2012 3:01 PM |
|
We use the default Lucene search engine without any modifications. It is version 2.9.2.2. I'm not aware of word ending issues like you mention. But the actual word storage etc is handled by Lucene, not by DMX. So that would have to show up in other applications using Lucene 2.9. What version of Lucene does Search Boost use? |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/04/2012 3:12 PM |
|
Peter, Thanks for the response. When I examined the DMX Index using Luke, it shows as Lucene version 3.1 The Search Boost version is 2.9. I also examined an old DMX index built using version 5.x and it appears correct (Never had a problem with the search until this new version). It appears that the search engine is dropping words ending in s (i.e. pilots) and es (committe, includ, provid). The search functionality is extremely important to us so I appreciate your help. |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/04/2012 4:16 PM |
|
Hi Darryl, OK. That is somewhat confusing. Here Luke tells me the DMX index is 2.9. Which would make sense since that is the version DMX came with. Can you verify the version nr on the Bring2mind.Lucene.Net.dll? Peter |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/04/2012 4:41 PM |
|
Peter, I updated to 6.04 and re-ran the index. Luke now shows version 2.9 (also shown in file property) but the index hasn't really changed (i.e., still dropping s, es, etc.) Here's a view of the DMX index of top ranking terms 1400 contents the 1392 contents to 1390 contents in 1387 contents a 1383 contents and 1374 contents of 1358 contents for 1354 contents on 1330 contents is 1319 contents will 1309 contents pilot 1297 contents with 1290 contents that 1287 contents be 1260 contents at 1252 contents this 1247 contents by 1239 contents are 1238 contents delta 1234 contents as 1209 contents s 1198 contents or 1162 contents an 1160 contents from 1117 contents not 1115 contents mec 1114 contents have 1097 contents all 1074 contents if 1063 contents provid 1043 contents has 1034 contents committe 1029 contents it 1025 contents may 989 contents alpa 989 contents time 973 contents you 956 contents one 936 contents ani 936 contents follow 934 contents your 918 contents 1 881 contents can 877 contents two 875 contents other 875 contents includ 862 contents been 836 contents line 835 contents which 831 contents inform Notice pilot (and not pilots), includ, provid, and committe. Here's the print out from the Seach Boost index (also very similar to the DMX 5.x index) 1493 Content file 1493 Content 0 1279 Content delta 1271 Content s 1243 Content pilots 1162 Content pilot 1128 Content mec 1115 Content have 1110 Content alpa 1097 Content all 1057 Content 1 1047 Content has 1025 Content committee 1024 Content may 991 Content you 959 Content one 951 Content your 938 Content any 886 Content can 879 Content time 877 Content two 861 Content been 857 Content other 848 Content 2 834 Content which 800 Content following 798 Content new 795 Content more 793 Content also 792 Content provide 788 Content during 787 Content 3 779 Content available 771 Content under 757 Content first 746 Content who 746 Content information 745 Content only 741 Content after 741 Content 10 739 Content 11 732 Content line 728 Content 12 723 Content please 721 Content 5 718 Content through 716 Content we 716 Content number 714 Content than 699 Content 7 It includes both pilot and pilots as well as committee, etc. Darryl |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/04/2012 8:48 PM |
|
Peter, While the index is dropping the s and es from words, I found that the Lucene.analysis.en.EnglishAnalyzer in Luke will return documents when I search for committees though the other analyzers will not. Don't know if this is helpful but I thought I would let you know. Darryl |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/05/2012 4:54 PM |
|
Hi Darryl, OK, I think we're getting somewhere. So it's not the version of Lucene, but the algorithm used when it parses the text coming in. I know that I pass the text in there and have to tell it what language it should expect. This is done based on Threading.Thread.CurrentThread.CurrentCulture. I.e. the language that DNN is currently running under. It then uses the SnowballAnalyzer of Lucene to parse the text. I then stumbled on this: http://stackoverflow.com/...analyzer-vs-snowball This looks very much like what is happening. I'm not sure about the best way forward but for now it looks like it will need a review. If you have the partial source version you could potentially already tweak this yourself to your own liking. For the main release I wonder if this should be revised and how. I.e. should there be another mechanism when retrieving the search or is it really the stemmer that is too aggressive? Can you give me an example of a search that is going wrong as a result of this? Peter |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/05/2012 5:14 PM |
|
Peter, Virtually any search for words ending in s or es are not returned at all. So a search for a proper name (Roberts) will return 0 results while a search for Robert will include all results of Robert and Roberts. Searching for "Committees" return nothing but a search for "Committe" return the results I expect. This seems opposite of the snowball link above as it states in the thread For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty. It may be stemming committees into 'committe' but a search for 'committees' returns nothing. Thanks again for your attention to this matter. Darryl |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/05/2012 5:29 PM |
|
From what I've read, the stemmer causes the original word to be lost and replaced with the "stem" of the word. So Roberts becomes Robert. What I don't get is how, when retrieving search results, Roberts could find Robert in the data (which is what the link appears to suggest). There is virtually no documentation on this, so it pretty tough to get to the bottom of this. But I'll give it a shot. Note that I'm out of the office for a couple of weeks, though. Peter |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/05/2012 5:38 PM |
|
Peter, Is the Analyzer part of Lucene or part of DMX? Would loading the DMX 5.x version of Lucene (2.0) get me the results I'm currently getting with 5.x. Anything to get me up and running in the short term while you look into a solution. Thanks, Darryl |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/05/2012 5:38 PM |
|
NB. During the search retrieval there is no definition of a stemmer. I forgot to add that. So "Roberts" goes in "as is" without being reduced to "Robert". |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/05/2012 5:57 PM |
|
Th analyzer is wrapped into Lucene. But the code that tells Lucene what analyzer to use is in DMX and part of the partial source package. Switching dlls won't work I'm afraid. You'll probably end up with a mess as the versions are all registered in the other dlls and .net will have a fit if you switch them. |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/09/2012 9:33 PM |
|
Peter, In the meantime I'm trying to go back to DMX 5x but after unistalling DMX 6x and installing 5x, I get following SQL error - Failure SQL Execution resulted in following Exceptions: System.Data.SqlClient.SqlException (0x80131904): There is already an object named 'PK_DMX_EntryPermissions' in the database. Could not create constraint. See previous errors. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning() at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at DotNetNuke.Data.SqlDataProvider.ExecuteADOScript(String SQL) at DotNetNuke.Data.SqlDataProvider.ExecuteScript(String Script, Boolean UseTransactions) ALTER TABLE dbo.[DMX_EntryPermissions] ADD CONSTRAINT [PK_DMX_EntryPermissions] PRIMARY KEY CLUSTERED ([EntryId], [PermissionId], [RoleId], [UserId]) Now I'm completely stuck as neither edition will install (or uninstall). I've manually removed all the DMX tables, stored procedures, views, and functions in the database as well as the DesktopModules/DMX folder but I continue to get this error. Hope you can help. Darryl |
|
|
|
|
Darryl Jenkins
New Member Posts:16
|
10/12/2012 2:56 PM |
|
Peter, I've got Version 5.3.9 loaded up on my site so I'll wait to upgrade after you've had a chance to look into the Lucene Search issue. Can you add an activation to my account so I can activate the 5x module. Thanks for all of your help. Darryl |
|
|
|
|
Peter Donker
Veteran Member Posts:4536
|
10/24/2012 5:55 PM |
|
Darryl, No problem. Please contact me by email for that with the invoice code.\ Peter |
|
|
|
|
david@designmind.com
New Member Posts:12
|
12/11/2012 8:14 PM |
|
We seem to be experiencing this problem as well. According to our DLLs we are using ver 6.0.3 (Bring2mind.Lucene.Net.dll is v2.9.2.2) Is this resolved in the latest version? |
|
|
|
|
david@designmind.com
New Member Posts:12
|
12/12/2012 7:08 PM |
|
Apologies. I missed the BUGNET box at the top of the thread. The answer to my own question is: DMX - 476 Lucene stemmer issues Fixed In Version: 06.01.00 |
|
|
|
|