From 07a3e29f397b4a8bd5ca124a1d020e050d20e131 Mon Sep 17 00:00:00 2001 From: Andreas Baumann Date: Wed, 12 Apr 2017 20:50:46 +0200 Subject: some fixes and published strus web search article --- content/blog/web-search-homepage.md | 39 +++++++++++----------- static/images/blog/web-search-homepage/search.png | Bin 0 -> 83416 bytes static/images/blog/web-search-homepage/strus.jpg | Bin 36406 -> 0 bytes 3 files changed, 19 insertions(+), 20 deletions(-) create mode 100644 static/images/blog/web-search-homepage/search.png delete mode 100644 static/images/blog/web-search-homepage/strus.jpg diff --git a/content/blog/web-search-homepage.md b/content/blog/web-search-homepage.md index e7083bb..3a24ce5 100644 --- a/content/blog/web-search-homepage.md +++ b/content/blog/web-search-homepage.md @@ -1,14 +1,13 @@ +++ -draft = true title = "Web search for my homepage" date = "2017-04-12T15:49:11+01:00" -categories = [ "Strus", "Search", "Information Retrieval" ] -thumbnail = "/images/blog/web-search-homepage/strus.jpg" +categories = [ "Strus", "Search", "Information Retrieval", "Web" ] +thumbnail = "/images/blog/web-search-homepage/search.png" +++ ## Intro I wanted to add a search function to my web page. -As the website is build with Hugo as a set of static +As the website is built with Hugo as a set of static HTML pages onto a read-only web server, standard approaches didn't work like a LIKE-query in Mysql as many CMS are implementing search. @@ -19,16 +18,16 @@ project. The basic idea is that the author of the web pages can build a search index locally with the markdown version -of his content and then push it to a webservice dedicated +of his content and then push it to a web service dedicated to search only. Again, the files making up the search index can be set to read-only after an update, leaving the system open to only DOSA or DDOSA (but which public system isn't). ## Installing strus for content indexing -So, I installed the packages 'strusutilties' for ArchLinux +So, I installed the 'strusutilities' package for ArchLinux on my local machine from the -[OpenBuildService](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=strusutilities) +[Open Build Service](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=strusutilities) with: ``` @@ -43,8 +42,9 @@ The command line tools consist of tools to analyze the document, apply some basic parsing and normalization of search terms. The tools take XML, JSON or TSV (tab-separated-values) currently. -My Hugo documents have their metadata in TOML and the content in -Markdown: +My Hugo documents have their metadata in +[TOML](https://en.wikipedia.org/wiki/TOML) and the content in +[Markdown](https://de.wikipedia.org/wiki/Markdown): ``` +++ @@ -55,7 +55,7 @@ thumbnail = "/images/blog/web-search-homepage/strus.jpg" +++ I wanted to add a search function to my web page. -As the website is build with Hugo as a set of static +As the website is built with Hugo as a set of static ... ``` @@ -67,9 +67,9 @@ file using: * [pandoc](http://pandoc.org/): convert markdown to tons of formats -I choose to convert to a DocBook style of XML and put all -the posts into one big file called `posts.xml`. The metadata is -embedded as a JSON value into the XML file in a tag ``. +I choose to convert to a [DocBook](http://docbook.org/whatis) style +of XML and put all the posts into one big file called `posts.xml`. +The metadata is embedded as a JSON value into the XML file in a tag ``. The final XML file looks like: @@ -90,7 +90,7 @@ The final XML file looks like: I wanted to add a search function to my - web page. As the website is build with + web page. As the website is built with Hugo as a set of static ... ``` @@ -108,10 +108,10 @@ I packaged this whole ugly conversion step into a script like that: ## Configuring the document analysis and indexing process -Now we define the configuration for the text analysis. Basically +Now we define the configuration for the text analysis. Basically, we tell the system where to split the document into retrievable items, which features we want to be able to search for and what -attributes and text we want to show in the ranklist. +attributes and text we want to show in the rank list. The file `document.ana` contains a configuration which describes how Strus should analyze and index the documents: @@ -170,7 +170,7 @@ when presenting the hit in the ranlist. The forward index stores the document almost verbatim as a sequence of title and text tokens. So when we get a hit in a search result we can present a selection of them (usually a sentence containing -the matches) in the ranklist. +the matches) in the rank list. Finally, we need to count the number of words per document, this is needed or the retrieval function: @@ -192,9 +192,8 @@ can copy to the server running the `strusWebService`. ## Installing the strusWebService for querying -On a publicly available server I installed the 'strusWebService': - -[OpenBuildService](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=struswebservice) +On a publicly available server I installed the 'strusWebService' package from the +[Open Build Service](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=struswebservice) with: ``` diff --git a/static/images/blog/web-search-homepage/search.png b/static/images/blog/web-search-homepage/search.png new file mode 100644 index 0000000..77db107 Binary files /dev/null and b/static/images/blog/web-search-homepage/search.png differ diff --git a/static/images/blog/web-search-homepage/strus.jpg b/static/images/blog/web-search-homepage/strus.jpg deleted file mode 100644 index 70c0776..0000000 Binary files a/static/images/blog/web-search-homepage/strus.jpg and /dev/null differ -- cgit v1.2.3-54-g00ecf