arabiCorpus

arabic corpus search tool

logged in as: Guest
Instructions
click on instructions link in search bar to access these instructions at any time

click on red text to expand and collapse information
General Information about arabiCorpus
  • Searches large, untagged arabic corpora for words you type in.
  • Searches for EXACTLY the string you type in (and nothing else).
  • Corpora can be searched singly or in various combinations.
  • Some filtering of results available.
  • Easy to search for individual words.
  • Ability to search for multiple words at once.
  • Accepts most of regular expression language.
  • Use the Tutorial below to help you with your initial searches.
Searching arabiCorpus
Results
  • Basic Information about Results
    • The results come back after a few seconds wait.
    • If you search for a single noun or adjective in a single corpus, it should come back in about 10 seconds
    • If you search for multiple nouns, or verbs, or search in the combined corpora, the results will take longer
    • If you search for a very common word that produces tens of thousands of hits or more, the program will sometimes balk because the database it is using will limit what can be inserted.
    • The results of a search are saved temporarily in a database, and you can access them by clicking on the words that appear in the red bar.
  • Summary Page
    • This page comes up first automatically after a search, and you can return to it by clicking on 'summary'
    • This page gives you summary information about your search:
      • the search string you typed in, in both scripts
      • the search strings actually used by the program after its 'figuring'
      • the corpus you searched
      • the time it took the search engine to do the basic search (this will generally be less than the actual time experienced by you, since it doesn't include the time it takes for the server to receive your request or to serve the results back to you).
      • the part of speech filter you chose
      • the part of speech filter that was actually used
      • the number of 'hits'
      • what that number translates to in terms of words per 100,000 words of your corpus
    • The latter bit of information is useful for comparative purposes. The corpora are of vastly different sizes, so the actual number per corpus may be misleading, but the number per 100,000 words can be compared.
  • Citations Page
    • Click on 'citations' in the red bar to see the 'hit's with the 10 words before and the 10 words after
    • By default, these citations are sorted by the word that appears directly before the word you searched for
      • Note that this sorting is done by the whole word, not by the root, so AlktAb, ktAb, and wAlktAb are nowhere near each other.
      • The word directly before is repeated at the left hand side of the page so you can quickly glance through the citations and notice patterns, collocations, etc.
    • If you would rather see the citations sorted by the word directly after the 'hit', click on the red sentence that says 'sort by word after'
    • The citations are shown 100 at a time.
      • If your search returned more than 100 citations, they are organized into 'pages' which you can access by clicking on the red numbers at the top.
    • If you searched a single corpus, the subsection column will indicate what subsection of the corpus the example comes from.
      • The newspapers have self-labeled subsections, so sometimes you need to think to figure out what the codes mean, but they are related to the various sections of the newspaper.
    • If you have searched a combined corpus, the subsection column will tell you which specific corpus the example came from.
    • If you want to see the exact reference of your example, click on the red 'subsection' heading and it will change to reference.
      • The references in many of the corpora are basically incomprehensible, and not very helpful, but the references to some, like the Quran and the novel, are useful.
      • The references in the Quran are to Sura and verse.
      • The references in the novel are to chapter, section of chapter (divided by stars), and paragraph number.
      • The references in 1001 Nights are to Nights and paragraphs (but in an odd kind of way I'll explain if you ask me by e-mail)
    • If you want to see more context than the 10 words before and the 10 words after provides, click on the number at the beginning of the citation.
      • This will bring up a separate window that will display the whole verse from the Quran, the whole paragraph from the novel and 1001 Nights, and the whole article from the newspapers.
      • You may have to use your browser's Arabic script search function to find the place in the larger context where your item appears, since it does not highlight it.
  • Subsections Page
    • Click on 'subsections' to see the totals for the various subsections of the corpus you searched
      • Again, if you searched a single corpus, those subsections will be those defined for that corpus.
      • Some corpora, particularly the non-news ones, have no subsections defined.
      • If you searched a combined corpus, the subsections will be the names of the corpora in that combined corpus
    • The results in this section are ordered from most frequent to less frequent in terms of number
    • The number for 100,000 for each subsection is also given.
    • This page allows you to compare, say, premodern Arabic with modern, or to compare newspapers from various places
      • Type lA bd and choose newspapers, for example, click on subsections, and you will find that the Ahram uses it much less frequently than the Hayat.
      • Type lAbd and choose newspapers, though, and you will find the opposite (the Ahram apparently typically types the phrase without a space, and the Hayat does the opposite).
  • Word Forms Page
    • Click on 'word forms' to see the exact forms that your search produced, orderd by frequency.
    • Examining this list can give you important hints about normal usage.
    • Examining this list can also help you see easily what kinds of false hits you are getting so you can work to eliminate them
    • It is strongly suggested that you examine this list for every search before you take the rest of the results seriously.
      • See if there are any forms you expected to find that aren't there
      • See if there are any forms you did not expect to find that are there
      • See if you can figure out what it is about the expression you typed in (and about Arabic morphology) that would have created these problems
    • Click on any word in the word form list and a window will open up with the citations for just that word form.
  • Words Before/After Page
    • Click on 'words before/after' to see a list of the common words directly before and directly after the 'hit' ordered by frequency.
    • Examining these lists is a good way to scope out the main usages and collocations of the word you searched for
    • Examining the lists can also help you identify structures that you had not intended to search for that you want to cut out
    • If you want to see only the citations with that particular word before or after, click on a word in the list
      • A separate window will open showing you just those citations
Miscellaneous
  • Recommended Browser
    • The Firefox browser is recommended, although you should be able to use it on other browsers as well.
  • Note About Numbers
  • Most of the data for the current corpora were created before unicode was in widespread use. The numbers (the digits), particularly, were entered in a variety of formats (even in the same article) which means that some show up frontwards and some backwards. I found myself unable to fix this, so you will simply have to live with the fact that the numbers might actually be reversed in any particular case.
  • Corpus Information
    • The corpora in the list currently include one year of Al-Ahram (1999), two years of Al-Hayat in separate corpora (1996, 1997), and a half year each of At-Tajdid (Moroccan) and Al-Watan (Kuwait), the Quran, 1001 Nights, several medieval medical and philosophical texts, 8 Egyptian novels, one Egyptian Arabic play, and some EgyptChat data from the internet, as well as the Penn Treebank news data. Most of the corpora are ony available in combined groups, but the individual newspapers can be searched separately. Choosing to search ALL or NEWSPAPERS (both large combinations of corpora) will often lead to MUCH longer search times. If you combine that with a search for something that is going to produce hundreds of thousands of hits, the machine may balk and not return anything. In general, the ALL category is mainly appropriate for researchers looking for broad quantitative or comparative data. The smaller categories are usually going to be more appropriate for pedagogical data, or data for students, since it will come in a quantity that you can understand and deal with.
    • The following are the word totals for the various corpora in the site. Note that you cannot add these for the total number, since the grouped corpora group and regroup in various configurations. For example, the Modern Literature group includes everything in Novels, and adds the play, the Egyptian Colloquial group includes the play and the Chat, etc. (By the way, you should be aware the EgyptChat includes some colloquial and a lot of fusha and mixed fusha and colloquail, while the literature and even the newspapers do include some colloquial.) The total numer of words of the whole corpus is: 68943447.
      • Ahram99: 16475979
      • Hayat97: 19473315
      • Hayat96: 21564239
      • Tajdid02: 2919782
      • Watan02: 6454411
      • Treebank: 598590
      • Novels: 387036
      • Quran: 84532
      • 1001Nights: 557908
      • Premodern: 912996
      • Medieval Science: 223249
      • Egyptian Colloquial: 157099
      • Modern Literature: 403901
  • Warning
    • Among the things you can do to make use of this tool annoying:
      • search for very common words or strings in multiple corpora (you have to wait basically forever). When learning to use the tool, try a small corpus (say Treebank) and search for slightly less common words.
  • Common errors to avoid:
    1. typing the noun with an article (the tool automatically looks for nouns and adjectives with and without the article; if you type it with the article it won't find any examples without it)
    2. choosing the wrong part of speech filter (looking for an adjective as a noun, or looking for a verb4 as verb);
    3. (by hand page:) not typing the input correctly for verb2, verb4 and verbd (note that these take special input on which see the instructions above; typing qAl and choosing verb4 will NOT give you the results you want);
    4. choosing noun when looking for a phrase (you should usually choose adv or string).
    5. (by hand page:) typing a word with one or more vowels, without clicking on 'include vowels'. Best is not to type vowels in the search word. See 'searching with vowels' in the advanced search section.
  • Announcements
    • 18 October 2007: A list of the novels currently in the Modern Literature corpus, as well as an updated word total list of the various corpora, has been added. Click on Searching arabiCorpus>Detailed Searching Instructions>Choose Corpus.
    • 25 July 2007: More novels, and some non-fiction has also been added. I am behind in listing the information about the new items but it will eventually come. I have also added about 16 million words of Al-Thawra, a Syrian newspaper. The subsections of this newspaper are not as helpful as the other ones in the corpus (since I rely on those already on the site). But it is good to have a pure Syrian source of data for comparative purposes.
    • 15 June 2007: Several new novels have been added, and the organization of the corpus menu has been made easier to navigate. Check the Corpus Information link under Miscelaneous to get an updated list of the items included and their word counts.
    • 30 June 2006: Several new features have been added, and the general search analysis has been upgraded:
      • The various verb choices under part of speech have been combined into one.
      • The program assumes you will type the masculine singular (huwa) form for both the past and present like this: ktb,yktb. If you type something different it will make guesses as to what you meant, but could guess wrong. In the case of hollow and doubled verbs, you can also choose to type all four stems as before.
      • Crucially, for doubled verbs, you need to type the shadda at the end. This is an exception to the rule that you don't type vowels into the search string. The shadda is stripped before searching, but the program uses it to become aware that what you typed is a doubled verb.
      • Vowels typed in are now automatically stripped out.
      • You can still search for multiple nouns and adjectives (say a noun and three forms of its plural) at the same time, but the program no longer allows you to search for more than one verb at once.
      • The program now automatically searches for duals as well as singulars on all nouns and adjectives (and verbs).
      • The program now does a somewhat better job of matching prefixes and suffixes on verbs, so that impossible combinations are weeded out better than before.
      • Choices like searching with vowels and not collapsing initial hamzas are no longer available on basic search. Click the advanced serach linnk to find them.
      • A new Advanced Search option has been added. Clicking on this takes you to a new page with three options (Note that the instructions available from that page give more details about these options):
        • Lookup: On this page you can type in a noun, and it will look up its plural(s) and include them in the search. If you look up an adjective it will do the same, but also include the feminine singular (for example if you type in LHmr, it will automaticallly also look for HmraC). If you choose Hollow, Doubled, Assimilated or one of the other verb forms listed, it will look up all associated forms and create the search string automatically. Note that it does not allow you to look up 'regular' verbs like ktb that don't need to be looked up, since the perfect and imperfect stems are exactly the same. To do that, you need to go to basic search or 'by hand' search.
        • Hand: This page is like the original basic search, but without trying to second guess you. It allows you to craft your search however you like, and makes you live with the consequences of inappropriate choices. This is the only place where you can now search with vowels or without collapsing initial hamzas.
        • Advanced Squared: This page is under construction and currently doesn't do anything. The plan is to allow you to completely specify the grammatical components of the form you are searching for, and specify which forms you want to find more completely. When finished, this will allow users who are not familiar with regular expressions to tweak their searches more effectively.
    • 15 May 2006:I have added a checkbox right under the submit button. Previously, you had to type in specifically whether you wanted to search for an alif at the beginning of a word alone, or with a hamza above, or with a hamza below. Now, no matter which of these (or even an alif with a madda) you type, it will search for all of them, unless you click the checkbox 'differentiate initial hamzas' in which case it will search only for what you type in. This was done because I discovered that most people really do want to see all examples of what they are looking for, but they forget to type in the choice (i.e. the want to see all examples of LwlAd 'boys' but they type either LwlAd or AwlAd, and either would only give them part of the examples. To see all the examples they would have needed to type [AL]wlAd. Now when you type AwlAd or LwlAd it automatically searches for [AL]wlAd unless you click the check box.
    • 9 May 2006:The tutorial is done (in its initial version). If you have time, check it out and send me feedback.
    • 9 May 2006: Two short books by Maimonedes on the practice of medicine (Asthma and Aphorisms) have been added to the Premodern corpora.
    • 5 May 2006: I have started to fix the tutorial to be a real, beginning level tutorial. If you are having trouble with the site, it would be a good place to start.
    • 24 April 2006: You may now click on a word in the word forms list and see the citations for just that word form (just as clicking on a before/after word in the before after list gives you just the citations with that word before or after).
    • 19 April 2006: I have added a tutorial section to these Instructions. I will give there help on specific searches that people have run, or that they have asked me about. It will be updated regularly, hopefully.
    • 19 April 2006: The program has been fixed to allow the use of grouping parentheses and vertical bars to signal alternation in regular expressions. This means, however, that you can no longer use the vertical bar for other purposes. To indicate you want the program to search for two words, use a comma instead.
    • 6 April 2006: A space bar now means a space bar, so you can no longer type in two words separated by a space bar and have the program search for either of those words. To do that you must use the comma. A space bar between two words now tells the program to search for that exact phrase in that order.
    • 5 April 2006: The part of speech femnoun has been omitted. It's function now takes place automatically when you choose 'noun'.
    Tutorial
    • First Search: Look for a single noun
      • Click on the red Instructions above the submit button to get these instructions in a separate window (otherwise they will go away as you follow them).
      • Click on 'dt chart' to see a chart of the transliteration system in a separate window.
      • Type mktb into the leftmost box. Do NOT type any vowels (i.e. do NOT type maktab). Do Not type the definite article (do NOT type Almktb).
      • Choose 'noun' from the POS list.
      • Choose Ahram99 from the corpus list.
      • Click on 'Submit'.
      • Wait about 10 seconds.
      • Examine summary of search that appears, and notice how many examples it found, and how frequent these are in the corpus (per 100,000 words).
      • Click on 'citations' in the dark red bar.
      • Scroll down and look at a few of the citations. Scroll back to the top.
      • Note that there are about 40 pages of results. Click on page 25.
      • Note that each example gives you the word in context with 10 words before and 10 after.
      • Note that the examples are organized by the word that comes before (here fy), and that this word is also listed at the beginning of each line so it can be easily picked out.
      • Click on 'sort by word after' to sort the examples by the word after instead.
      • Click on page 25 again, and see what kind of information you get with this order.
      • Click on one of the numbers at the left, and see a new window open with even more context. Close that window
      • Click on 'subsections' in the dark red bar.
      • Examine the frequencies and relative frequencies of this word in the various sections of the Ahram.
      • Click on 'word forms' in the dark red bar.
      • Notice the different forms in which this word was found: alone, with wa-, bi-, fa-, with the definite article, and with various pronoun endings. Notice which of these forms were more common and which relatively rare.
      • Click on مكتبك in the middle of the second column to see the citations just for that word form in a separate window. Examine briefly and close the second window.
      • Click on 'words before/after' in the dark red bar.
      • Examine the most common words that come before our search word and the most common words that come after. Any surprises? Is this what you would have predicted?
      • Click on التنسيق in the second column to see the 200 examples of this word coming after our search word in a new window. After examining for a few moments, close the new window.
      • Click on 'summary' in the dark red bar to go back to the summary page.
    • Try other single words
      • Type jmyl into the leftmost box, choose the 'adj' POS, the Ahram99 corpus, and click submit.
      • Go through the various choices (as under First Search) and see the differences. Note particularly under 'word forms' that you get examples with and without the feminine ending, but no examples with pronoun endings or prepositions.
      • Now try running the same word on the same corpus but with 'noun' POS chosen. Note under 'word forms' that you don't get the feminine forms, but you do get forms with prepositions and pronouns.
      • Type tqrybA into the leftmost box, choose 'adv' POS and the Ahram99 corpus and run through the various options once you get the results.
      • Type ktAbhm into the leftmost box, choose 'adv' POS and the Ahram99 corpus, and look to see what you got under word forms. 'adv' is a good POS choice when you are looking for a very specific form, and don't want to see other possible prefixes and suffixes.
      • Type prb into the leftmost box, choose the 'verb' POS, the Ahram99 corpus, and run through the various options looking at the results. Note particularly the large number of forms under 'word forms'.
      • Now run prb again, but this time choose the 'string' POS. Note under word forms that besides all the verb forms you got a moment ago, you are now getting a bunch of other things that happen to contain the sequence prb, including a bunch of typos.
    • Account for variabilty in the corpus
      • Type AwlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
      • Look through the various pages of results. Note on the word form page that you got no hits with a hamza on the beginning alif.
      • Now type LwlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
      • Look through these results. Note that on the word form page that you got no hits without a hamza on the beginning alif.
      • These results are mutually exclusive. But what if you want to see them all together, or what if you wanted to know how many times this word was used in the corpus, whether or not the hamza was written on the beginning alif?
      • There are several ways to have the corpus look for alternates (either this OR that). The easiest involves square brackets.
      • Type [AL]wlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
      • The square brackets tell the program to search for any one of the characters in the brackets. In this case it searches for either A or L, meaning that it is searching for either AwlAd or LwlAd.
      • Look through these results. You can see from the total numbers, that the examples of AwlAd and LwlAd have been combined. You can see the same thing on the word forms page.
      • In general, if you want ALL the examples of any form that begins with a hamza on an alif, you should use square brackets to have the program search for both possibilities.
      • Now type lbnAne into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
      • Notice that you get 20 hits that end with an alif maqsura. If you had searched for lbnAny and wanted to see all examples of use of this word, the program would have missed these 20.
      • In general, if you want ALL examples of forms that end in y, you would need to use [ey] instead.
      • The Ahram itself is a special case because for whatever reason they use the yaa' for the alif maqsuura much of the time. So if you search for mqhe (noun) in the Ahram99 corpus you get zero results, but if you search for mqhy you get 158, and examining their citations reveals that they are meant to be mqhe.
      • Therefore, if the Ahram is included in your search, you should use [ye] for words that end in y AND e.
      • Now type msWwl into the leftmost box, choose 'adj' POS and the Newspaper combined corpus. You will have to wait longer for the results because of the larger corpus.
      • If you check out the subsections page, you will see that there are a huge number of hits in the other papers, but very few in the Ahram.
      • Now type msYwl into the leftmost box, choose 'adj' POS, and the Newspaper combined corpus.
      • Notice in the subsections page that almost all the examples are from the Ahram, and very few from the others.
      • If you wanted to catch ALL instances of this word in the Newspaper combined corpus, you would need to type ms[WY]wl.
      • Now type tlfwn into the leftmost box, choose 'noun' POS, and the Ahram99 corpus. Look how few the results are.
      • Now type tlyfwn into the leftmost box, choose 'noun' POS, and the Ahram99 corpus. Look how many the results are.
      • If you want to find tlfwn and tlyfwn together you could type tly?fwn. The question mark tells the program the yaa' can either be there or not.
      • The program is really just an automoton. It doesn't know what you want. It just looks for exactly what you type in. If you need it to pay attention to spelling variation, then you must tell it to do so.
      • Moral: PAY ATTENTION to spelling variations in Arabic, and account for them in your searches, often by using square brackets or question marks.
    • Search for more than one word at once
      • Type mktb,mkAtb into the leftmost box. Note that there is no space after the comma. Choose 'noun' POS and the Ahram99 corpus.
      • Notice on the word forms page that the two words have been combined into a single search. You can string any number together with commas.
      • Now try the following combinations:
        • hjwm,hjmAt,hjwmAt,hjmQ,mhAjmQ (check out the word forms page to get an idea of relative frequency)
        • bTbycQ AlHAl,TbcA,bAlTbc (ditto)
        • hAtf,tlfwn,tlyfwn,mwbAyl,mHmwl,xlwy
    • Filter results 'by hand'
      • Type lbnAn into the leftmost box, choose 'noun' POS, and Ahram99 corpus.
      • Examine the word forms. Notice that you have quite a few examples of lbnAny, which the program got by putting the y suffix for 'my' on the end of the noun.
      • However, if you click on this word on the word forms page and examine these citations, you will see that none of them mean 'my Lebanon'. The y here is a nisba adjective suffix.
      • In other words, you have gotten a bunch of things you don't want because of the morphological ambiguity of Arabic.
      • Let's say you want to look through all the citations with lbnAn, and you don't want to have to deal with all those lbnAny's.
      • Type lbnAn -- y$ into the leftmost box, choose 'noun' POS and Ahram99 corpus.
      • The two dashes tell the program that whatever you type after them is NOT wanted (filter it out).
      • The dollar sign means that this is at the end of the word.
      • So this means, search for all examples of lbnAn with normal noun suffixes, but if you find one the ends in a y, filter it out.
      • A look at the word forms list will convince you that it has now gotten rid of the unwanted lbnAny's.
      • Try the following searches with and without the dash section and check the word forms page to see what it cuts out:
        • mktb -- ^m (this will cut out all word forms that begin with a miim, i.e. all that don't have some other prefix)
        • mktb -- h (this one will cut out all word forms that have a haa' in them, meaning that mktb with certain pronoun endings will be cut out.
        • mktb -- h|y|n (this one will cut out all word forms with a haa', kaaf, yaa', or nuun, meaning that maktb with all but second person pronoun endings will be filtered out. Note that the vertical bar represents alternation.
        • mktb -- h|y|n|km?A?$ (this one also cuts out the second person forms, but you have to be careful: if you just list kaaf by itself it will cut out everything, since kaaf is in the word itself. So you need the question marks and the dollar sign.
      • A good strategy is to perform plain searches first, and then examine the word form list to see if there are any forms you would like to filter out by hand. If there are, create a 'dash' phrase to do the job.
      • Note for those with a 'regular expression' background: You can consider the list after the dashes to be a regular expression for which the program automatically provides an outside set of grouping parentheses.
    • Search for a Form IV or VIII verb
      • Type [LA]krm into the leftmost box, choose 'verb' POS and Ahram99.
      • Remember that the square brackets mean either this or that (i.e. either the alif with or without the hamza).
      • Check total and the word forms. Notice that they are all basically past tense. Doesn't this verb have present tense forms?
      • Type [LA]krm,krm (just like that with no space after the comma), choose 'verb2' POS and Ahram99.
      • Notice you got a lot more hits from the total, and under word forms, notice that you get all the imperfect forms with the perfects now.
      • Now type Antxb,ntxb and then choose verb2 POS and Ahram99. Again check word forms to see that you got both tenses.
      • Now type wSl,Sl and then choose verb2 POS and Ahram99. Check word forms.
      • Choosing plain verb, the program tries all the verbal prefixes and suffixes on the single stem you type in.
      • This works for verbs that use the same graphical stem in the perfect and imperfect, but not for the others.
      • The POS 'verb2' requires you to input 2 different stems, and it will try all the perfect suffixes on the first one, and all the imperfect prefixes and suffixes on the second.
      • When searching for a verb, stop a moment and think: If I add the imperfect prefixes on exactly what I just typed in, is it going to work?
      • If not, you probably should use 'verb2' or one of the other choices so you can tell the program exactly what the imperfect stem is.
      • On the other hand, if you choose 'verb2' but just type in one stem (like [LA]krm), it will still only find the perfects.
      • The ONLY way to find all the forms of Form IV, VII, VIII, IX, X and Form I assimilated verbs is to choose 'verb2' and type in TWO stems, separated by a comma.
    • Search for a doubled or hollow verb
      • Type zAr into the leftmost box, choose 'verb' POS, and Ahram99.
      • Check out the word forms and notice that they are all either the 'he', 'she', and 'they' forms of the past tense, or present tense passives.
      • (You will also notice ambiguous forms like wzArth, which COULD mean 'and she visited him' but which probably mean 'his ministry', but that is a different issue. See instructions on filtering results by hand if desired.)
      • Notice that there are no past or present tense forms with the short stem, like zrt, and no present tense forms with the long uu stem like yzwr.
      • Type zAr,zwr into the leftmost box, choose 'verb2' POS and Ahram99.
      • Check out the word forms. This time you get all the long forms like zArwA and yzwr, but none of the short forms.
      • Now type zAr,zr,zwr,zr into the leftmost box, choose 'verb4' POS and Ahram99.
      • Notice in word forms that you now get the short forms as well, like zrt.
      • When you choose 'verb4', the program checks for the perfect long form suffixes (he, she, they) on the first form you type in, for the perfect short form suffixes on the second form you type in (you, I, they-f), for the imperfect long form prefixes and suffixes on the third form you type in (all but they-f and you-pl-f), and for the imperfect short form prefixes and suffixes on the fourth form you type in (they-f and you-pl-f).
      • If you only type in one form when you choose 'verb4' the program will only check for the perfect he, she, they suffixes and nothing else.
      • Type mr,mrr,mr,mrr into the leftmost box, choose 'verb4' and Ahram 99.
      • Notice under word forms that you get both the cases where one 'r' is written, and cases where two 'r's are written.
      • In general, whenever you want to find all the forms of a verb, you need to think carefully about the various stems that the endings go on.
      • Almost all verbs (except for defective roots) fit into one of three categories:
        • a single (graphic) stem that all pefect and imperfect prefixes and suffixes are added to (choose 'verb')
        • two stems, one used for all perfect suffixes, and the other for all imperfect prefixes and suffixes (choose 'verb2')
        • four stems, one for perfect suffixes that begin with a vowel, the second for perfect suffixes that begin with a consonant, the third for imperfect suffixes that begin with a vowel, and the fourth for imperfect suffixes that begin with a consonant (choose 'verb4')
      • The following categories of verbs fit these types:
        • single stem ('verb'): Sound Form I, II, III, V, VI verbs
        • two stems ('verb2'): Assimilated Form I, hamzated form II, Sound Form IV, VII, VIII, IX, X verbs
        • four stems ('verb4'): Hollow verbs, Doubled verbs of all forms
      • The program does not have any kind of dictionary lookup and it doesn't know what form verb you have typed in.
      • All it does is look for the exact sequence of characters you type in, and check it with the prefixes and suffixes it knows for all verbs.
      • Summary: the program will not do a good job of finding all forms of particular verbs unless YOU take responsibility to figure out the morphological category of the verb you are searching for, and search for it in a way appropriate to that category
    • Search for a defective verb
      • Type in dc,au into the leftmost box, choose 'verbd' POS and Hayat97.
      • Check out the word forms. Note that you are getting most of the various stems of the verb dcA ydcw.
      • Try the other types of defectives: bn,ai and lq,ia and tlq,aa
      • The POS 'verbd' expects you to type a stem (without the long vowel at the end), a comma, and then the vowel pattern (perfect then imperfect).
      • The following vowel patterns are recognized:
        • au (for verbs like dcA ydcw)
        • ai (for verbs like bne ybny, all Form VIII and X passives, and it can also be used for passive defective verbs (which are actually ui))
        • ia (for verbs like lqy ylqe)
        • aa (for defective Form V and VI verbs like tlqe ytlqe)
      • There are several reasons why the results of searches like these are less useful and accurate than the other kinds of verb searches, but it is something.
      • You can always use 'string' and search for particular forms, or use regular expressions to define exactly what you want to get.
    • Search for an abstract form
      • Note: Before trying these examples, remember that when a search produces more than 50,000 hits, it will sometimes balk and not return anything. When you are searching for patterns that are common, first choose a very small corpus to try it out on.
      • Type m\wA\w\wQ into the leftmost box, choose noun POS, and the corpus Smell of the Sea (a short novel).
      • This is the pattern for a form III Verbal noun, and so should find every form III VN in the text.
      • \w stands for any Arabic character, so m\wA\w\wQ means something like mfAclQ (where fcl stand for any root).
      • Check out the word forms and see the most common Form III verbal nouns in the novel.
      • Now try the same search on the Ahram99 corpus.
      • If you were able to get it to go through without balking, and if you had enough patience, you would discover under word forms:
        • that mbArAQ and mpArkQ are the most common Form III verbal nouns in the Ahram
        • that there 3208 different Form III verbal nouns word forms used in the Ahram
        • that there are some words that fit our pattern but that are not Form III verbal nouns
      • Now try finding some other abstract forms (go back to using Smell of the Sea or another short corpus):
        • Type t\w\wy\w into the leftmost box, choose 'noun' POS, and Smell of the Sea (Form II verbal nouns).
        • Type s[ytnLA]\w\w\w\w? into the leftmost box, choose 'adv' POS, and Smell of the Sea (most future singular verbs with sa- but without pronoun endings--note there are many false hits).
        • Type \w\w\w+Q into the leftmost box, choose 'adv' POS, and Smell of the Sea (all feminine nouns at least 4 letters long).
        • Type y\w\w[aui]\w into the leftmost box, choose 'adv' POS, and Hayat97 and click the include vowels button (all four letter words beginning with y-mostly verbs-that have a vowel marked on the next to last letter and no other vowels, or three letter words beginning with y with one other vowel marked).
        • Type \w[aui]\w[aui]\w into the leftmost box, choose 'adv' POS, and Hayat97 and click the include vowels button (all three letter words with vowels marked on the first and second letter).
    • Search for a phrase
      • Type lA gbAr into the leftmost box, choose 'adv' POS, and Hayat97.
      • Check out some of the citations.
      • Try these other phrases:
        • Type (Al)?bnyQ (Al)?tHtyQ into the leftmost box, choose 'string' POS and Hayat97.
        • Type pyk bdwn rSyd into the leftmost box, choose 'string' POS and Hayat97 (one hit only, check out the citation)
        • Type ttHlb lh Al[LA]fwAh into the leftmost box, choose 'string' POS and All (one hit only, check out the citation)
        • Type \bwzArQ Al\w+ into the leftmost box, choose 'string' POS and Treebank (to see a list of different ministries)
        • Try AlwzArQ Al\w+ (obviously not as useful)
        • Type mA lbV [LA]n into the leftmost box, choose 'string' POS and Treebank.
        • Try (mA lbV|lm ylbV) [LA]n with 'string' POS and Hayat97 (be prepared to wait)
      • You are basically limited only by your imagination. However, if you want to be sure to get the results you want, remember to account for hamza variations and other spelling variations.
      • Also remember that some writers do not put spaces where others do (some write lAbd and others lA bd). You can capture this with lA ?bd (try it with Treebank)
    • Compare results of different corpora
      • Type hAtf into the leftmost box, choose 'noun' POS and Newspapers.
      • Go to subsections and see what papers use it more commonly than others.
      • Now type tly?fwn into the leftmost box, choose 'noun' POS and Newspapers.
      • Compare the subsection results with the ones you got for hAtif.
      • Type lqd into the leftmost box, choose 'adv' POS and All.
      • Note in subsections that the Quran and the novel have a much higher rate of use than the other corpora, and that the Ahram uses it more frequently that the Hayat.
      • type shl into the leftmost box, choose 'adj' POS and Hayat97.
      • Look in subsections and see which parts of the paper like this adjective more than others.
      • Then try it with Ahram99 and see if the same pattern holds. Look at the words before words after page to see if you get any hints as to why this may be so.
      • Try the same thing with hjwm (noun) and see if the subsection patterns make sense to you.
    Questions/Problems (Click question to see answer)
    • Why do I want to filter my results?
      • If you are trying to find examples of a particular word or construction, and the program finds hundreds of results of something else that happens to match what you are looking for (because of the morphological ambiguity of Arabic), it can be very time consuming and annoying to search through the citations one by one looking for the ones you really want. If you can figure out a way to filter out the 'bad' ones, you can save yourself many hours.
    • You want to take care of filtering yourself, and bypass the POS filters.
      • Choose 'string' and type a regular expression that matches what the POS filters do, or which varies them.
      • \bgbAr\b will find ONLY the word gbAr with no prefixes or suffixes.
      • \b[wf]?gbAr\b allows gbAr, wgbAr and fgbAr and nothing else.
      • \b[wf]?(Al)?gbAr\b allows gbAr, wgbAr and fgbAr with and without the definite article.
      • \b[wf]?(Al)?gbAr(h|hA|k|y)?\b allows all of the above and the singular pronoun endings.
      • \b[ytnLA]ktb[wyA]?[nA]?\b is one way to find present tense verb forms, here without any other prefixes or suffixes.
      • The program can deal with quite complex expressions efficiently, as long as parentheses are matched, so you can get quite a bit of control over what you are searching for if you want it.
    • You want to find all examples of the verb twj, but looking at the results you realize that you have many instances of twjh, which could conceivably mean 'he crowned him' but which 99% of the time is in fact another verb: tawajjaha 'to head for'.
      • Choose 'verb' and search for twj -- h. This will delete all the twjh's. You may miss a couple of 'he crowned him's' but it's worth it because you can now see that your results are mainly what you want and expected.
    • You want to find all examples of the verb qAl, including passives.
      • Searching for qAl,ql,qwl,ql with 'verb4' will give you all the actives. Searching for qyl,ql,qAl,ql will give you all the passives. Searching for q[Ay]l,ql,q[wA]l,ql will give you both all sorted together. Try this on a smaller corpus like the Treebank first, since it generates a huge number of hits.
    • You want to find all examples of the verb Lyd ('to support).
      • Searching for Lyd using 'verb' will miss all the imperfects and all the perfects where the hamza was not typed on the alif. Searching for [LA]yd will get the hamza/alif variation, but will still miss the imperfects. To get those you need to choose 'verb2' instead and type: [LA]yd,Wyd i.e. the perfect stem and the imperfect stem with a comma in between. This should give you what you are looking for.
    • You want to get comparative statistics on use of the verb forms yumkinu and tumkinu when they have a feminine verbal noun subject, but when you search for tmkn (using 'adv') you get overwhelmed with things that are really tamakkana, tamakkun, tumakkinu and the like.
      • This one is a good example of the inherent ambiguity of much of Arabic morphology, particularly graphemically. To get a statistic you can rely on, particularly if you don't have the stomach to actually read through hundreds of citations, you probably need to do what I call a 'limited search'. Instead of looking for all instances of tmkn, do a search for [yt]mkn \w+(Q|t(h|hA|k|hm|hmA|nA)) which will give you a list of all the words that end in a taa' marbuuta or a taa' followed by a pronoun ending after one of these verbs and look at the list of words before and after. Look down the list of words after and pick out (say) the top 25 (or 10 or 5) feminine verbal nouns that come directly after ymkn or tmkn, and make sure they 'feel' right (i.e. you are pretty sure that the nouns you are choosing really are likely to function as the subjects of these verbs. Then do a search for those forms only, once with ymkn before, and once with tmkn. You don't have an overall statistic, but you have a 'subset' statistic that you can trust. Sample search strings with 5 verbs I found would be:
      • ymkn (AstfAd|tsmy|syTr|zyAd|mwAjh)(Q|t(h|hA|k|y|hm|hn|km|kn|nA|hmA|kmA))
      • tmkn (AstfAd|tsmy|syTr|zyAd|mwAjh)(Q|t(h|hA|k|y|hm|hn|km|kn|nA|hmA|kmA))
      • Comparing the totals you get on those two searches should give you a good start at figuring out the relative frequency.
    site maintained by d. parkinson. contact him with any problems or suggestions.